pyhanko.pdf_utils.writer module¶
Utilities for writing PDF files. Contains code from the PyPDF2 project; see here for the original license.
- class pyhanko.pdf_utils.writer.ObjectStream(compress=True)¶
Bases:
object
Utility class to collect objects into a PDF object stream.
Object streams are mainly useful for space efficiency reasons. They allow related objects to be grouped & compressed together in a more flexible manner.
Warning
Object streams can only be used in files with a cross-reference stream, as opposed to a classical XRef table. In particular, this means that incremental updates to files with a legacy XRef table cannot contain object streams either. See § 7.5.7 in ISO 32000-1 for further details.
Warning
The usefulness of object streams is somewhat stymied by the fact that PDF stream objects cannot be embedded into object streams for syntactical reasons.
- add_object(idnum: int, obj: pyhanko.pdf_utils.generic.PdfObject)¶
Add an object to an object stream. Note that objects in object streams always have their generation number set to 0 by definition.
- Parameters
idnum – The object’s ID number.
obj – The object to embed into the object stream.
- Raises
TypeError – Raised if
obj
is an instance ofStreamObject
orIndirectObject
.
- as_pdf_object() → pyhanko.pdf_utils.generic.StreamObject¶
Render the object stream to a PDF stream object
- Returns
An instance of
StreamObject
.
- class pyhanko.pdf_utils.writer.BasePdfFileWriter(root, info, document_id, obj_id_start=0, stream_xrefs=True)¶
Bases:
pyhanko.pdf_utils.rw_common.PdfHandler
Base class for PDF writers.
- output_version = (1, 7)¶
Output version to be declared in the output file.
- stream_xrefs: bool¶
Boolean controlling whether or not the output file will contain its cross-references in stream format, or as a classical XRef table.
The default for new files is
True
. For incremental updates, the writer adapts to the system used in the previous iteration of the document (as mandated by the standard).
- set_info(info: Optional[Union[pyhanko.pdf_utils.generic.IndirectObject, pyhanko.pdf_utils.generic.DictionaryObject]])¶
Set the
/Info
entry of the document trailer.- Parameters
info – The new
/Info
dictionary, either as an indirect reference or as aDictionaryObject
- property document_id: Tuple[bytes, bytes]¶
- mark_update(obj_ref: Union[pyhanko.pdf_utils.generic.Reference, pyhanko.pdf_utils.generic.IndirectObject])¶
Mark an object reference to be updated. This is only relevant for incremental updates, but is included as a no-op by default for interoperability reasons.
- Parameters
obj_ref – An indirect object instance or a reference.
- update_container(obj: pyhanko.pdf_utils.generic.PdfObject)¶
Mark the container of an object (as indicated by the
container_ref
attribute onPdfObject
) for an update.As with
mark_update()
, this only applies to incremental updates, but defaults to a no-op.- Parameters
obj – The object whose top-level container needs to be rewritten.
- property root_ref: pyhanko.pdf_utils.generic.Reference¶
- Returns
A reference to the document catalog.
- update_root()¶
Signal that the document catalog should be written to the output. Equivalent to calling
mark_update()
withroot_ref
.
- get_object(ido)¶
Retrieve the object associated with the provided reference from this PDF handler.
- Parameters
ref – An instance of
generic.Reference
.- Returns
A PDF object.
- allocate_placeholder() → pyhanko.pdf_utils.generic.IndirectObject¶
Allocate an object reference to populate later. Calls to
get_object()
for this reference will returnNullObject
until it is populated usingadd_object()
.This method is only relevant in certain advanced contexts where an object ID needs to be known before the object it refers to can be built; chances are you’ll never need it.
- Returns
A
IndirectObject
instance referring to the object just allocated.
- add_object(obj, obj_stream: Optional[pyhanko.pdf_utils.writer.ObjectStream] = None, idnum=None) → pyhanko.pdf_utils.generic.IndirectObject¶
Add a new object to this writer.
- Parameters
obj – The object to add.
obj_stream – An object stream to add the object to.
idnum – Manually specify the object ID of the object to be added. This is only allowed for object IDs that have previously been allocated using
allocate_placeholder()
.
- Returns
A
IndirectObject
instance referring to the object just added.
- prepare_object_stream(compress=True)¶
Prepare and return a new
ObjectStream
object.- Parameters
compress – Indicates whether the resulting object stream should be compressed.
- Returns
An
ObjectStream
object.
- property trailer_view: pyhanko.pdf_utils.generic.DictionaryObject¶
Returns a view of the document trailer of the document represented by this
PdfHandler
instance.The view is effectively read-only, in the sense that any writes will not be reflected in the actual trailer (if the handler supports writing, that is).
- Returns
A
generic.DictionaryObject
representing the current state of the document trailer.
- write(stream)¶
Write the contents of this PDF writer to a stream.
- Parameters
stream – A writable output stream.
- register_annotation(page_ref, annot_ref)¶
Register an annotation to be added to a page. This convenience function takes care of calling
mark_update()
where necessary.- Parameters
page_ref – Reference to the page object involved.
annot_ref – Reference to the annotation object to be added.
- insert_page(new_page, after=None)¶
Insert a page object into the tree.
- Parameters
new_page – Page object to insert.
after – Page number (zero-indexed) after which to insert the page.
- Returns
A reference to the newly inserted page.
- import_object(obj: pyhanko.pdf_utils.generic.PdfObject, obj_stream: Optional[pyhanko.pdf_utils.writer.ObjectStream] = None) → pyhanko.pdf_utils.generic.PdfObject¶
Deep-copy an object into this writer, dealing with resolving indirect references in the process.
Danger
The table mapping indirect references in the input to indirect references in the writer is not preserved between calls. Concretely, this means that invoking
import_object()
twice on the same input reader may cause object duplication.- Parameters
obj – The object to import.
obj_stream –
The object stream to import objects into.
Note
Stream objects and bare references will not be put into the object stream; the standard forbids this.
- Returns
The object as associated with this writer. If the input object was an indirect reference, a dictionary (incl. streams) or an array, the returned value will always be a new instance.
- import_page_as_xobject(other: pyhanko.pdf_utils.rw_common.PdfHandler, page_ix=0, content_stream=0, inherit_filters=True)¶
Import a page content stream from some other
PdfHandler
into the current one as a form XObject.- Parameters
other – A
PdfHandler
page_ix – Index of the page to copy (default: 0)
content_stream – Index of the page’s content stream to copy, if multiple are present (default: 0)
inherit_filters – Inherit the content stream’s filters, if present.
- Returns
An
IndirectObject
referring to the page object as added to the current reader.
- class pyhanko.pdf_utils.writer.PageObject(contents, media_box, resources=None)¶
Bases:
pyhanko.pdf_utils.generic.DictionaryObject
Subclass of
DictionaryObject
that handles some of the initialisation boilerplate for page objects.
- class pyhanko.pdf_utils.writer.PdfFileWriter(stream_xrefs=True, init_page_tree=True)¶
Bases:
pyhanko.pdf_utils.writer.BasePdfFileWriter
Class to write new PDF files.
- stream_xrefs: bool¶
Boolean controlling whether or not the output file will contain its cross-references in stream format, or as a classical XRef table.
The default for new files is
True
. For incremental updates, the writer adapts to the system used in the previous iteration of the document (as mandated by the standard).
- object_streams: List[pyhanko.pdf_utils.writer.ObjectStream]¶
- security_handler: Optional[pyhanko.pdf_utils.crypt.SecurityHandler]¶
- encrypt(owner_pass, user_pass=None)¶
Mark this document to be encrypted with PDF 2.0 encryption (AES-256).
Caution
While pyHanko supports legacy PDF encryption as well, the API to create new documents using outdated encryption is left largely undocumented on purpose to discourage its use.
This caveat does not apply to incremental updates added to existing documents.
Danger
The PDF 2.0 standard mandates AES-256 in CBC mode, and also includes 12 bytes of known plaintext by design. This implies that a sufficiently knowledgeable attacker can inject arbitrary content into your encrypted files without knowledge of the password.
Adding a digital signature to the encrypted document is not a foolproof way to deal with this either, since most viewers will still allow the document to be opened before signatures are validated, and therefore end users are still exposed to potentially malicious content.
Until the standard supports authenticated encryption schemes, you should never rely on its encryption provisions if tampering is a concern.
- Parameters
owner_pass – The desired owner password.
user_pass – The desired user password (defaults to the owner password if not specified)
- encrypt_pubkey(recipients: List[asn1crypto.x509.Certificate])¶
Mark this document to be encrypted with PDF 2.0 public key encryption. The certificates passed in should be RSA certificates.
PyHanko defaults to AES-256 to encrypt the actual file contents. The seed used to derive the file encryption key is also encrypted using AES-256 and bundled in a CMS EnvelopedData object. The envelope key is then encrypted separately for each recipient, using their respective public keys.
Caution
The caveats for
encrypt()
also apply here.- Parameters
recipients – Certificates of the recipients that should be able to decrypt the document.
- pyhanko.pdf_utils.writer.init_xobject_dictionary(command_stream: bytes, box_width, box_height, resources: Optional[pyhanko.pdf_utils.generic.DictionaryObject] = None) → pyhanko.pdf_utils.generic.StreamObject¶
Helper function to initialise form XObject dictionaries.
Note
For utilities to handle image XObjects, see
images
.- Parameters
command_stream – The XObject’s raw appearance stream.
box_width – The width of the XObject’s bounding box.
box_height – The height of the XObject’s bounding box.
resources – A resource dictionary to include with the form object.
- Returns
A
StreamObject
representation of the form XObject.
- pyhanko.pdf_utils.writer.copy_into_new_writer(input_handler: pyhanko.pdf_utils.rw_common.PdfHandler) → pyhanko.pdf_utils.writer.PdfFileWriter¶
Copy all objects in a given PDF handler into a new
PdfFileWriter
. This operation will attempt to preserve the document catalog of the originalinput_handler
.Very roughly, calling this function and then immediately invoking
write()
on the resulting writer should result in an equivalent document as far as presentation is concerned. As a general rule, behaviour that is controlled from outside the document catalog (e.g. encryption) or that requires byte-for-byte equivalence with the original (e.g. digital signatures) will not survive this translation.- Parameters
input_handler –
PdfHandler
to source objects from.- Returns
New
PdfFileWriter
containing all objects from the input handler.