pyhanko.pdf_utils.writer module

Utilities for writing PDF files. Contains code from the PyPDF2 project; see here for the original license.

class pyhanko.pdf_utils.writer.BasePdfFileWriter(root, info, document_id, obj_id_start=0, stream_xrefs=True)

Bases: PdfHandler

Base class for PDF writers.

output_version = (1, 7): Output version to be declared in the output file.

stream_xrefs: bool

Boolean controlling whether or not the output file will contain its cross-references in stream format, or as a classical XRef table.

The default for new files is True. For incremental updates, the writer adapts to the system used in the previous iteration of the document (as mandated by the standard).

get_subset_collection(base_postscript_name: str)

property document_meta: DocumentMetadata

property document_meta_view: DocumentMetadata

ensure_output_version(version)

set_info(info: Optional[Union[IndirectObject, DictionaryObject]])

Set the /Info entry of the document trailer.

Parameters: info – The new /Info dictionary, either as an indirect reference or as a DictionaryObject

set_custom_trailer_entry(key: NameObject, value: PdfObject)

Set a custom, unmanaged entry in the document trailer or cross-reference stream dictionary.

Warning

Calling this method to set an entry that is managed by pyHanko internally (info dictionary, document catalog, etc.) has undefined results.

Parameters

key – Dictionary key to use in the trailer.
value – Value to set

property document_id: Tuple[bytes, bytes]

mark_update(obj_ref: Union[Reference, IndirectObject])

Mark an object reference to be updated. This is only relevant for incremental updates, but is included as a no-op by default for interoperability reasons.

Parameters: obj_ref – An indirect object instance or a reference.

update_container(obj: PdfObject)

Mark the container of an object (as indicated by the container_ref attribute on PdfObject) for an update.

As with mark_update(), this only applies to incremental updates, but defaults to a no-op.

Parameters: obj – The object whose top-level container needs to be rewritten.

property root_ref: Reference

Returns: A reference to the document catalog.

update_root(): Signal that the document catalog should be written to the output. Equivalent to calling mark_update() with root_ref.

register_extension(ext: DeveloperExtension)

get_object(ido, as_metadata_stream: bool = False)

Retrieve the object associated with the provided reference from this PDF handler.

Parameters

ref – An instance of generic.Reference.
as_metadata_stream – Whether to dereference the object as an XMP metadata stream.

Returns

A PDF object.

allocate_placeholder() → IndirectObject

Allocate an object reference to populate later. Calls to get_object() for this reference will return NullObject until it is populated using add_object().

This method is only relevant in certain advanced contexts where an object ID needs to be known before the object it refers to can be built; chances are you’ll never need it.

Returns: A IndirectObject instance referring to the object just allocated.

add_object(obj, obj_stream: Optional[ObjectStream] = None, idnum=None) → IndirectObject

Add a new object to this writer.

Parameters

obj – The object to add.
obj_stream – An object stream to add the object to.
idnum – Manually specify the object ID of the object to be added. This is only allowed for object IDs that have previously been allocated using allocate_placeholder().

Returns

A IndirectObject instance referring to the object just added.

prepare_object_stream(compress=True)

Prepare and return a new ObjectStream object.

Parameters: compress – Indicates whether the resulting object stream should be compressed.
Returns: An ObjectStream object.

property trailer_view: DictionaryObject

Returns a view of the document trailer of the document represented by this PdfHandler instance.

The view is effectively read-only, in the sense that any writes will not be reflected in the actual trailer (if the handler supports writing, that is).

Returns: A generic.DictionaryObject representing the current state of the document trailer.

write(stream)

Write the contents of this PDF writer to a stream.

Parameters: stream – A writable output stream.

register_annotation(page_ref, annot_ref)

Register an annotation to be added to a page. This convenience function takes care of calling mark_update() where necessary.

Parameters

page_ref – Reference to the page object involved.
annot_ref – Reference to the annotation object to be added.

insert_page(new_page, after=None)

Insert a page object into the tree.

Parameters

new_page – Page object to insert.
after – Page number (zero-indexed) after which to insert the page.

Returns

A reference to the newly inserted page.

import_object(obj: PdfObject, obj_stream: Optional[ObjectStream] = None) → PdfObject

Deep-copy an object into this writer, dealing with resolving indirect references in the process.

Danger

The table mapping indirect references in the input to indirect references in the writer is not preserved between calls. Concretely, this means that invoking import_object() twice on the same input reader may cause object duplication.

Parameters

obj – The object to import.
obj_stream –
The object stream to import objects into.

Note

Stream objects and bare references will not be put into the object stream; the standard forbids this.

Returns

The object as associated with this writer. If the input object was an indirect reference, a dictionary (incl. streams) or an array, the returned value will always be a new instance.

import_page_as_xobject(other: PdfHandler, page_ix=0, inherit_filters=True)

Import a page content stream from some other PdfHandler into the current one as a form XObject.

Parameters

other – A PdfHandler
page_ix – Index of the page to copy (default: 0)
inherit_filters – Inherit the content stream’s filters, if present.

Returns

An IndirectObject referring to the page object as added to the current reader.

add_stream_to_page(page_ix, stream_ref, resources=None, prepend=False)

Append an indirect stream object to a page in a PDF as a content stream.

Parameters

page_ix – Index of the page to modify. The first page has index 0.
stream_ref – IndirectObject reference to the stream object to add.
resources – Resource dictionary containing resources to add to the page’s existing resource dictionary.
prepend – Prepend the content stream to the list of content streams, as opposed to appending it to the end. This has the effect of causing the stream to be rendered underneath the already existing content on the page.

Returns

An IndirectObject reference to the page object that was modified.

add_content_to_page(page_ix, pdf_content: PdfContent, prepend=False)

Convenience wrapper around add_stream_to_page() to turn a PdfContent instance into a page content stream.

Parameters

page_ix – Index of the page to modify. The first page has index 0.
pdf_content – An instance of PdfContent
prepend – Prepend the content stream to the list of content streams, as opposed to appending it to the end. This has the effect of causing the stream to be rendered underneath the already existing content on the page.

Returns

An IndirectObject reference to the page object that was modified.

merge_resources(orig_dict, new_dict) → bool

Update an existing resource dictionary object with data from another one. Returns True if the original dict object was modified directly.

The caller is responsible for avoiding name conflicts with existing resources.

class pyhanko.pdf_utils.writer.PageObject(contents, media_box, resources=None)

Bases: DictionaryObject

Subclass of DictionaryObject that handles some of the initialisation boilerplate for page objects.

class pyhanko.pdf_utils.writer.PdfFileWriter(stream_xrefs=True, init_page_tree=True, info=None)

Bases: BasePdfFileWriter

Class to write new PDF files.

encrypt(owner_pass, user_pass=None, **kwargs)

Mark this document to be encrypted with PDF 2.0 encryption (AES-256).

Caution

While pyHanko supports legacy PDF encryption as well, the API to create new documents using outdated encryption is left largely undocumented on purpose to discourage its use.

This caveat does not apply to incremental updates added to existing documents.

Danger

The PDF 2.0 standard mandates AES-256 in CBC mode, and also includes 12 bytes of known plaintext by design. This implies that a sufficiently knowledgeable attacker can inject arbitrary content into your encrypted files without knowledge of the password.

Adding a digital signature to the encrypted document is not a foolproof way to deal with this either, since most viewers will still allow the document to be opened before signatures are validated, and therefore end users are still exposed to potentially malicious content.

Until the standard supports authenticated encryption schemes, you should never rely on its encryption provisions if tampering is a concern.

Parameters

owner_pass – The desired owner password.
user_pass – The desired user password (defaults to the owner password if not specified)
kwargs – Other keyword arguments to be passed to StandardSecurityHandler.build_from_pw().

encrypt_pubkey(recipients: List[Certificate], **kwargs)

Mark this document to be encrypted with PDF 2.0 public key encryption. The certificates passed in should be RSA certificates.

PyHanko defaults to AES-256 to encrypt the actual file contents. The seed used to derive the file encryption key is also encrypted using AES-256 and bundled in a CMS EnvelopedData object. The envelope key is then encrypted separately for each recipient, using their respective public keys.

Caution

The caveats for encrypt() also apply here.

Parameters

recipients – Certificates of the recipients that should be able to decrypt the document.
kwargs – Other keyword arguments to be passed to PubKeySecurityHandler.build_from_certs().

stream_xrefs: bool

Boolean controlling whether or not the output file will contain its cross-references in stream format, or as a classical XRef table.

The default for new files is True. For incremental updates, the writer adapts to the system used in the previous iteration of the document (as mandated by the standard).

object_streams: List[ObjectStream]

security_handler: Optional[SecurityHandler]

set_custom_trailer_entry(key: NameObject, value: PdfObject)

Set a custom, unmanaged entry in the document trailer or cross-reference stream dictionary.

Warning

Calling this method to set an entry that is managed by pyHanko internally (info dictionary, document catalog, etc.) has undefined results.

Parameters

key – Dictionary key to use in the trailer.
value – Value to set

pyhanko.pdf_utils.writer.init_xobject_dictionary(command_stream: bytes, box_width, box_height, resources: Optional[DictionaryObject] = None) → StreamObject

Helper function to initialise form XObject dictionaries.

Note

For utilities to handle image XObjects, see images.

Parameters

command_stream – The XObject’s raw appearance stream.
box_width – The width of the XObject’s bounding box.
box_height – The height of the XObject’s bounding box.
resources – A resource dictionary to include with the form object.

Returns

A StreamObject representation of the form XObject.

pyhanko.pdf_utils.writer.copy_into_new_writer(input_handler: PdfHandler, writer_kwargs: Optional[dict] = None) → PdfFileWriter

Copy all objects in a given PDF handler into a new PdfFileWriter. This operation will attempt to preserve the document catalog of the original input_handler.

Very roughly, calling this function and then immediately invoking write() on the resulting writer should result in an equivalent document as far as presentation is concerned. As a general rule, behaviour that is controlled from outside the document catalog (e.g. encryption) or that requires byte-for-byte equivalence with the original (e.g. digital signatures) will not survive this translation.

Parameters

input_handler – PdfHandler to source objects from.
writer_kwargs – Keyword arguments to pass to the writer.

Returns

New PdfFileWriter containing all objects from the input handler.