pyhanko.pdf_utils.writer module

Utilities for writing PDF files. Contains code from the PyPDF2 project; see here for the original license.

class pyhanko.pdf_utils.writer.BasePdfFileWriter(root, info, document_id, obj_id_start=0, stream_xrefs=True)

Bases: PdfHandler

Base class for PDF writers.

output_version = (1, 7)

Output version to be declared in the output file.

stream_xrefs: bool

Boolean controlling whether or not the output file will contain its cross-references in stream format, or as a classical XRef table.

The default for new files is True. For incremental updates, the writer adapts to the system used in the previous iteration of the document (as mandated by the standard).

get_subset_collection(base_postscript_name: str)
property document_meta: DocumentMetadata
property document_meta_view: DocumentMetadata
ensure_output_version(version)
set_info(info: Optional[Union[IndirectObject, DictionaryObject]])

Set the /Info entry of the document trailer.

Parameters

info – The new /Info dictionary, either as an indirect reference or as a DictionaryObject

set_custom_trailer_entry(key: NameObject, value: PdfObject)

Set a custom, unmanaged entry in the document trailer or cross-reference stream dictionary.

Warning

Calling this method to set an entry that is managed by pyHanko internally (info dictionary, document catalog, etc.) has undefined results.

Parameters
  • key – Dictionary key to use in the trailer.

  • value – Value to set

property document_id: Tuple[bytes, bytes]
mark_update(obj_ref: Union[Reference, IndirectObject])

Mark an object reference to be updated. This is only relevant for incremental updates, but is included as a no-op by default for interoperability reasons.

Parameters

obj_ref – An indirect object instance or a reference.

update_container(obj: PdfObject)

Mark the container of an object (as indicated by the container_ref attribute on PdfObject) for an update.

As with mark_update(), this only applies to incremental updates, but defaults to a no-op.

Parameters

obj – The object whose top-level container needs to be rewritten.

property root_ref: Reference
Returns

A reference to the document catalog.

update_root()

Signal that the document catalog should be written to the output. Equivalent to calling mark_update() with root_ref.

register_extension(ext: DeveloperExtension)
get_object(ido, as_metadata_stream: bool = False)

Retrieve the object associated with the provided reference from this PDF handler.

Parameters
  • ref – An instance of generic.Reference.

  • as_metadata_stream – Whether to dereference the object as an XMP metadata stream.

Returns

A PDF object.

allocate_placeholder() IndirectObject

Allocate an object reference to populate later. Calls to get_object() for this reference will return NullObject until it is populated using add_object().

This method is only relevant in certain advanced contexts where an object ID needs to be known before the object it refers to can be built; chances are you’ll never need it.

Returns

A IndirectObject instance referring to the object just allocated.

add_object(obj, obj_stream: Optional[ObjectStream] = None, idnum=None) IndirectObject

Add a new object to this writer.

Parameters
  • obj – The object to add.

  • obj_stream – An object stream to add the object to.

  • idnum – Manually specify the object ID of the object to be added. This is only allowed for object IDs that have previously been allocated using allocate_placeholder().

Returns

A IndirectObject instance referring to the object just added.

prepare_object_stream(compress=True)

Prepare and return a new ObjectStream object.

Parameters

compress – Indicates whether the resulting object stream should be compressed.

Returns

An ObjectStream object.

property trailer_view: DictionaryObject

Returns a view of the document trailer of the document represented by this PdfHandler instance.

The view is effectively read-only, in the sense that any writes will not be reflected in the actual trailer (if the handler supports writing, that is).

Returns

A generic.DictionaryObject representing the current state of the document trailer.

write(stream)

Write the contents of this PDF writer to a stream.

Parameters

stream – A writable output stream.

register_annotation(page_ref, annot_ref)

Register an annotation to be added to a page. This convenience function takes care of calling mark_update() where necessary.

Parameters
  • page_ref – Reference to the page object involved.

  • annot_ref – Reference to the annotation object to be added.

insert_page(new_page, after=None)

Insert a page object into the tree.

Parameters
  • new_page – Page object to insert.

  • after – Page number (zero-indexed) after which to insert the page.

Returns

A reference to the newly inserted page.

import_object(obj: PdfObject, obj_stream: Optional[ObjectStream] = None) PdfObject

Deep-copy an object into this writer, dealing with resolving indirect references in the process.

Danger

The table mapping indirect references in the input to indirect references in the writer is not preserved between calls. Concretely, this means that invoking import_object() twice on the same input reader may cause object duplication.

Parameters
  • obj – The object to import.

  • obj_stream

    The object stream to import objects into.

    Note

    Stream objects and bare references will not be put into the object stream; the standard forbids this.

Returns

The object as associated with this writer. If the input object was an indirect reference, a dictionary (incl. streams) or an array, the returned value will always be a new instance.

import_page_as_xobject(other: PdfHandler, page_ix=0, inherit_filters=True)

Import a page content stream from some other PdfHandler into the current one as a form XObject.

Parameters
  • other – A PdfHandler

  • page_ix – Index of the page to copy (default: 0)

  • inherit_filters – Inherit the content stream’s filters, if present.

Returns

An IndirectObject referring to the page object as added to the current reader.

add_stream_to_page(page_ix, stream_ref, resources=None, prepend=False)

Append an indirect stream object to a page in a PDF as a content stream.

Parameters
  • page_ix – Index of the page to modify. The first page has index 0.

  • stream_refIndirectObject reference to the stream object to add.

  • resources – Resource dictionary containing resources to add to the page’s existing resource dictionary.

  • prepend – Prepend the content stream to the list of content streams, as opposed to appending it to the end. This has the effect of causing the stream to be rendered underneath the already existing content on the page.

Returns

An IndirectObject reference to the page object that was modified.

add_content_to_page(page_ix, pdf_content: PdfContent, prepend=False)

Convenience wrapper around add_stream_to_page() to turn a PdfContent instance into a page content stream.

Parameters
  • page_ix – Index of the page to modify. The first page has index 0.

  • pdf_content – An instance of PdfContent

  • prepend – Prepend the content stream to the list of content streams, as opposed to appending it to the end. This has the effect of causing the stream to be rendered underneath the already existing content on the page.

Returns

An IndirectObject reference to the page object that was modified.

merge_resources(orig_dict, new_dict) bool

Update an existing resource dictionary object with data from another one. Returns True if the original dict object was modified directly.

The caller is responsible for avoiding name conflicts with existing resources.

class pyhanko.pdf_utils.writer.PageObject(contents, media_box, resources=None)

Bases: DictionaryObject

Subclass of DictionaryObject that handles some of the initialisation boilerplate for page objects.

class pyhanko.pdf_utils.writer.PdfFileWriter(stream_xrefs=True, init_page_tree=True, info=None)

Bases: BasePdfFileWriter

Class to write new PDF files.

encrypt(owner_pass, user_pass=None, **kwargs)

Mark this document to be encrypted with PDF 2.0 encryption (AES-256).

Caution

While pyHanko supports legacy PDF encryption as well, the API to create new documents using outdated encryption is left largely undocumented on purpose to discourage its use.

This caveat does not apply to incremental updates added to existing documents.

Danger

The PDF 2.0 standard mandates AES-256 in CBC mode, and also includes 12 bytes of known plaintext by design. This implies that a sufficiently knowledgeable attacker can inject arbitrary content into your encrypted files without knowledge of the password.

Adding a digital signature to the encrypted document is not a foolproof way to deal with this either, since most viewers will still allow the document to be opened before signatures are validated, and therefore end users are still exposed to potentially malicious content.

Until the standard supports authenticated encryption schemes, you should never rely on its encryption provisions if tampering is a concern.

Parameters
  • owner_pass – The desired owner password.

  • user_pass – The desired user password (defaults to the owner password if not specified)

  • kwargs – Other keyword arguments to be passed to StandardSecurityHandler.build_from_pw().

encrypt_pubkey(recipients: List[Certificate], **kwargs)

Mark this document to be encrypted with PDF 2.0 public key encryption. The certificates passed in should be RSA certificates.

PyHanko defaults to AES-256 to encrypt the actual file contents. The seed used to derive the file encryption key is also encrypted using AES-256 and bundled in a CMS EnvelopedData object. The envelope key is then encrypted separately for each recipient, using their respective public keys.

Caution

The caveats for encrypt() also apply here.

Parameters
stream_xrefs: bool

Boolean controlling whether or not the output file will contain its cross-references in stream format, or as a classical XRef table.

The default for new files is True. For incremental updates, the writer adapts to the system used in the previous iteration of the document (as mandated by the standard).

object_streams: List[ObjectStream]
security_handler: Optional[SecurityHandler]
set_custom_trailer_entry(key: NameObject, value: PdfObject)

Set a custom, unmanaged entry in the document trailer or cross-reference stream dictionary.

Warning

Calling this method to set an entry that is managed by pyHanko internally (info dictionary, document catalog, etc.) has undefined results.

Parameters
  • key – Dictionary key to use in the trailer.

  • value – Value to set

pyhanko.pdf_utils.writer.init_xobject_dictionary(command_stream: bytes, box_width, box_height, resources: Optional[DictionaryObject] = None) StreamObject

Helper function to initialise form XObject dictionaries.

Note

For utilities to handle image XObjects, see images.

Parameters
  • command_stream – The XObject’s raw appearance stream.

  • box_width – The width of the XObject’s bounding box.

  • box_height – The height of the XObject’s bounding box.

  • resources – A resource dictionary to include with the form object.

Returns

A StreamObject representation of the form XObject.

pyhanko.pdf_utils.writer.copy_into_new_writer(input_handler: PdfHandler, writer_kwargs: Optional[dict] = None) PdfFileWriter

Copy all objects in a given PDF handler into a new PdfFileWriter. This operation will attempt to preserve the document catalog of the original input_handler.

Very roughly, calling this function and then immediately invoking write() on the resulting writer should result in an equivalent document as far as presentation is concerned. As a general rule, behaviour that is controlled from outside the document catalog (e.g. encryption) or that requires byte-for-byte equivalence with the original (e.g. digital signatures) will not survive this translation.

Parameters
  • input_handlerPdfHandler to source objects from.

  • writer_kwargs – Keyword arguments to pass to the writer.

Returns

New PdfFileWriter containing all objects from the input handler.