pyhanko.pdf_utils package

Subpackages

Submodules

pyhanko.pdf_utils.barcodes module

class pyhanko.pdf_utils.barcodes.BarcodeBox(barcode_type, code)

Bases: PdfContent

Thin wrapper around python-barcode functionality.

This will render a barcode of the specified type as PDF graphics operators.

render() bytes

Compile the content to graphics operators.

class pyhanko.pdf_utils.barcodes.PdfStreamBarcodeWriter

Bases: BaseWriter

Implementation of writer class for the python-barcode library to output PDF graphics operators. Note: _paint_text is intentionally dummied out. Please use the functionality implemented in pyhanko.pdf_utils.text instead.

property command_stream: bytes
save(filename, output)

Saves the rendered output to filename.

Parameters:
filenameString

Filename without extension.

outputString

The rendered output.

Returns:

The full filename with extension.

Return type:

String

pyhanko.pdf_utils.content module

class pyhanko.pdf_utils.content.ResourceType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: Enum

Enum listing resources that can be used as keys in a resource dictionary.

See ISO 32000-1, § 7.8.3 Table 34.

EXT_G_STATE = '/ExtGState'

External graphics state specifications. See ISO 32000-1, § 8.4.5.

COLOR_SPACE = '/ColorSpace'

Colour space definitions. See ISO 32000-1, § 8.6.

PATTERN = '/Pattern'

Pattern definitions. See ISO 32000-1, § 8.7.

SHADING = '/Shading'

Shading definitions. See ISO 32000-1, § 8.7.4.3.

XOBJECT = '/XObject'

External object definitions (images and form XObjects). See ISO 32000-1, § 8.8.

FONT = '/Font'

Font specifications. See ISO 32000-1, § 9.

PROPERTIES = '/Properties'

Marked content properties. See ISO 32000-1, § 14.6.2.

exception pyhanko.pdf_utils.content.ResourceManagementError

Bases: ValueError

Used to signal problems with resource dictionaries.

class pyhanko.pdf_utils.content.PdfResources

Bases: object

Representation of a PDF resource dictionary.

This class implements __getitem__() with ResourceType keys for dynamic access to its attributes. To merge two instances of PdfResources into one another, the class overrides __iadd__(), so you can write.

res1 += res2

Note: Merging two resource dictionaries with conflicting resource names will produce a ResourceManagementError.

Note: This class is currently only used for new resource dictionaries.

as_pdf_object() DictionaryObject

Render this instance of PdfResources to an actual resource dictionary.

class pyhanko.pdf_utils.content.PdfContent(resources: PdfResources | None = None, box: BoxConstraints | None = None, writer: BasePdfFileWriter | None = None)

Bases: object

Abstract representation of part of a PDF content stream.

Warning

Whether PdfContent instances can be reused or not is left up to the subclasses.

writer = None

The __init__() method comes with an optional writer parameter that can be used to let subclasses register external resources with the writer by themselves.

It can also be set after the fact by calling set_writer().

set_resource(category: ResourceType, name: NameObject, value: PdfObject)

Set a value in the resource dictionary associated with this content fragment.

Parameters:
  • category – The resource category to which the resource belongs.

  • name – The resource’s (internal) name.

  • value – The resource’s value.

import_resources(resources: PdfResources)

Import resources from another resource dictionary.

Parameters:

resources – An instance of PdfResources.

Raises:

ResourceManagementError – Raised when there is a resource name conflict.

property resources: PdfResources
Returns:

The PdfResources instance associated with this content fragment.

render() bytes

Compile the content to graphics operators.

as_form_xobject() StreamObject

Render the object to a form XObject to be referenced by another content stream. See ISO 32000-1, § 8.8.

Note: Even if writer is set, the resulting form XObject will not be registered. This is left up to the caller.

Returns:

A StreamObject instance representing the resulting form XObject.

set_writer(writer)

Override the currently registered writer object.

Parameters:

writer – An instance of BasePdfFileWriter.

add_to_page(writer: BasePdfFileWriter, page_ix: int, prepend: bool = False)

Convenience wrapper around BasePdfFileWriter.add_stream_to_page() to turn a PdfContent instance into a page content stream.

Parameters:
  • writer – A PDF file writer.

  • page_ix – Index of the page to modify. The first page has index 0.

  • prepend – Prepend the content stream to the list of content streams, as opposed to appending it to the end. This has the effect of causing the stream to be rendered underneath the already existing content on the page.

Returns:

An IndirectObject reference to the page object that was modified.

class pyhanko.pdf_utils.content.RawContent(data: bytes, resources: PdfResources | None = None, box: BoxConstraints | None = None)

Bases: PdfContent

Raw byte sequence to be used as PDF content.

render() bytes

Compile the content to graphics operators.

class pyhanko.pdf_utils.content.ImportedPdfPage(file_name, page_ix=0)

Bases: PdfContent

Import a page from another PDF file (lazily)

render() bytes

Compile the content to graphics operators.

pyhanko.pdf_utils.embed module

Utility classes for handling embedded files in PDFs.

Added in version 0.7.0.

pyhanko.pdf_utils.embed.embed_file(pdf_writer: BasePdfFileWriter, spec: FileSpec)

Embed a file in the document-wide embedded file registry of a PDF writer.

Parameters:
  • pdf_writer – PDF writer to house the embedded file.

  • spec – File spec describing the embedded file.

Returns:

class pyhanko.pdf_utils.embed.EmbeddedFileObject(pdf_writer: BasePdfFileWriter, dict_data=None, stream_data=None, encoded_data=None, params: EmbeddedFileParams | None = None, mime_type: str | None = None)

Bases: StreamObject

classmethod from_file_data(pdf_writer: BasePdfFileWriter, data: bytes, compress=True, params: EmbeddedFileParams | None = None, mime_type: str | None = None) EmbeddedFileObject

Construct an embedded file object from file data.

This is a very thin wrapper around the constructor, with a slightly less intimidating API.

Note

This method will not register the embedded file into the document’s embedded file namespace, see embed_file().

Parameters:
  • pdf_writer – PDF writer to use.

  • data – File contents, as a bytes object.

  • compress – Whether to compress the embedded file’s contents.

  • params – Optional embedded file parameters.

  • mime_type – Optional MIME type string.

Returns:

An embedded file object.

write_to_stream(stream, handler=None, container_ref=None)

Abstract method to render this object to an output stream.

Parameters:
  • stream – An output stream.

  • container_ref – Local encryption key.

  • handler – Security handler

class pyhanko.pdf_utils.embed.EmbeddedFileParams(embed_size: bool = True, embed_checksum: bool = True, creation_date: datetime.datetime | None = None, modification_date: datetime.datetime | None = None)

Bases: object

embed_size: bool = True

If true, record the file size of the embedded file.

Note

This value is computed over the file content before PDF filters are applied. This may have performance implications in cases where the file stream contents are presented in pre-encoded form.

embed_checksum: bool = True

If true, add an MD5 checksum of the file contents.

Note

This value is computed over the file content before PDF filters are applied. This may have performance implications in cases where the file stream contents are presented in pre-encoded form.

creation_date: datetime | None = None

Record the creation date of the embedded file.

modification_date: datetime | None = None

Record the modification date of the embedded file.

class pyhanko.pdf_utils.embed.FileSpec(file_spec_string: str, file_name: str | None = None, embedded_data: EmbeddedFileObject | None = None, description: str | None = None, af_relationship: NameObject | None = None, f_related_files: List[RelatedFileSpec] | None = None, uf_related_files: List[RelatedFileSpec] | None = None)

Bases: object

Dataclass modelling an embedded file description in a PDF.

file_spec_string: str

A path-like file specification string, or URL.

Note

For backwards compatibility, this string should be encodable in PDFDocEncoding. For names that require general Unicode support, refer to file_name.

file_name: str | None = None

A path-like Unicode file name.

embedded_data: EmbeddedFileObject | None = None

Reference to a stream object containing the file’s data, as embedded in the PDF file.

description: str | None = None

Textual description of the file.

af_relationship: NameObject | None = None

Associated file relationship specifier.

Related files with PDFDocEncoded names.

Related files with Unicode-encoded names.

as_pdf_object() DictionaryObject

Represent the file spec as a PDF dictionary.

class pyhanko.pdf_utils.embed.RelatedFileSpec(name: str, embedded_data: EmbeddedFileObject)

Bases: object

Dataclass modelling a RelatedFile construct in PDF.

name: str

Name of the related file.

Note

The encoding requirements of this field depend on whether the related file is included via the /F or /UF key.

embedded_data: EmbeddedFileObject

Reference to a stream object containing the file’s data, as embedded in the PDF file.

pyhanko.pdf_utils.embed.wrap_encrypted_payload(plaintext_payload: bytes, *, password: str | None = None, certs: List[Certificate] | None = None, security_handler: SecurityHandler | None = None, file_spec_string: str = 'attachment.pdf', params: EmbeddedFileParams | None = None, file_name: str | None = None, description='Wrapped document', include_explanation_page=True) PdfFileWriter

Include a PDF document as an encrypted attachment in a wrapper document.

This function sets certain flags in the wrapper document’s collection dictionary to instruct compliant PDF viewers to display the attachment instead of the wrapping document. Viewers that do not fully support PDF collections will display a landing page instead, explaining how to open the attachment manually.

Using this method mitigates some weaknesses in the PDF standard’s encryption provisions, and makes it harder to manipulate the encrypted attachment without knowing the encryption key.

Danger

Until PDF supports authenticated encryption mechanisms, this is a mitigation strategy, not a foolproof defence mechanism.

Warning

While users of viewers that do not support PDF collections can still open the attached file manually, the viewer still has to support PDF files where only the attachments are encrypted.

Note

This is not quite the same as the “unencrypted wrapper document” pattern discussed in the PDF 2.0 specification. The latter is intended to support nonstandard security handlers. This function uses a standard security handler on the wrapping document to encrypt the attachment as a binary blob. Moreover, the functionality in this function is available in PDF 1.7 viewers as well.

Parameters:
  • plaintext_payload – The plaintext payload (a binary representation of a PDF document).

  • security_handler – The security handler to use on the wrapper document. If None, a security handler will be constructed based on the password or certs parameter.

  • password – Password to encrypt the attachment with. Will be ignored if security_handler is provided.

  • certs – Encrypt the file using PDF public-key encryption, targeting the keys in the provided certificates. Will be ignored if security_handler is provided.

  • file_spec_string – PDFDocEncoded file spec string for the attachment.

  • params – Embedded file parameters to use.

  • file_name – Unicode file name for the attachment.

  • description – Description for the attachment

  • include_explanation_page – If False, do not generate an explanation page in the wrapper document. This setting could be useful if you want to customise the wrapper document’s behaviour yourself.

Returns:

A PdfFileWriter representing the wrapper document.

pyhanko.pdf_utils.extensions module

class pyhanko.pdf_utils.extensions.DevExtensionMultivalued(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: Enum

Setting indicating how an extension is expected to behave well w.r.t. the new mechanism for multivalued extensions in ISO 32000-2:2020.

ALWAYS = 1

Always serialise this extension as a multivalued extension.

NEVER = 2

Never serialise this extension as a multivalued extension.

MAYBE = 3

Make this extension single-valued whenever possible, but allow multiple values as well, e.g. when a different but non-comparable extension with the same prefix is already present in the file.

class pyhanko.pdf_utils.extensions.DeveloperExtension(prefix_name: NameObject, base_version: NameObject, extension_level: int, url: str | None = None, extension_revision: str | None = None, compare_by_level: bool = False, subsumed_by: Iterable[int] = (), subsumes: Iterable[int] = (), multivalued: DevExtensionMultivalued = DevExtensionMultivalued.MAYBE)

Bases: object

PDF developer extension designation.

prefix_name: NameObject

Registered developer prefix.

base_version: NameObject

Base version on to which the extension applies.

extension_level: int

Extension level.

url: str | None = None

Optional URL linking to the extension’s documentation.

extension_revision: str | None = None

Optional extra revision information. Not comparable.

compare_by_level: bool = False

Compare developer extensions by level number. If this value is True and a copy of this extension already exists in the target file with a higher level number, do not override it. If one exists with a lower level number, override it.

If this value is False, the decision is based on subsumed_by and subsumes.

Warning

It is generally not safe to assume that extension levels are used as a versioning system (i.e. that higher extension levels supersede lower ones), hence why the default is False.

subsumed_by: Iterable[int] = ()

List of extension levels that would subsume this one. If one of these is present in the extensions dictionary, attempting to register this extension will not override it.

Default value: empty.

Warning

This parameter is ignored if compare_by_level is True.

subsumes: Iterable[int] = ()

List of extensions explicitly subsumed by this one. If one of these is present in the extensions dictionary, attempting to register this extension will override it.

Default value: empty.

Warning

This parameter is ignored if compare_by_level is True.

multivalued: DevExtensionMultivalued = 3

Setting indicating whether this extension is expected to behave well w.r.t. the new mechanism for multivalued extensions in ISO 32000-2:2020.

as_pdf_object() DictionaryObject

Format the data in this object into a PDF dictionary for registration into the /Extensions dictionary.

Returns:

A generic.DictionaryObject.

pyhanko.pdf_utils.filters module

Implementation of stream filters for PDF.

Taken from PyPDF2 with modifications. See here for the original license of the PyPDF2 project.

Note that not all decoders specified in the standard are supported. In particular /LZWDecode and the various JPEG-based decoders are missing.

class pyhanko.pdf_utils.filters.Decoder

Bases: object

General filter/decoder interface.

decode(data: bytes, decode_params: dict) bytes

Decode a stream.

Parameters:
  • data – Data to decode.

  • decode_params – Decoder parameters, sourced from the /DecoderParams entry associated with this filter.

Returns:

Decoded data.

encode(data: bytes, decode_params: dict) bytes

Encode a stream.

Parameters:
  • data – Data to encode.

  • decode_params – Encoder parameters, sourced from the /DecoderParams entry associated with this filter.

Returns:

Encoded data.

class pyhanko.pdf_utils.filters.ASCII85Decode

Bases: Decoder

Implementation of the base 85 encoding scheme specified in ISO 32000-1.

encode(data: bytes, decode_params=None) bytes

Encode a stream.

Parameters:
  • data – Data to encode.

  • decode_params – Encoder parameters, sourced from the /DecoderParams entry associated with this filter.

Returns:

Encoded data.

decode(data, decode_params=None)

Decode a stream.

Parameters:
  • data – Data to decode.

  • decode_params – Decoder parameters, sourced from the /DecoderParams entry associated with this filter.

Returns:

Decoded data.

class pyhanko.pdf_utils.filters.ASCIIHexDecode

Bases: Decoder

Wrapper around binascii.hexlify() that implements the Decoder interface.

encode(data: bytes, decode_params=None) bytes

Encode a stream.

Parameters:
  • data – Data to encode.

  • decode_params – Encoder parameters, sourced from the /DecoderParams entry associated with this filter.

Returns:

Encoded data.

decode(data, decode_params=None)

Decode a stream.

Parameters:
  • data – Data to decode.

  • decode_params – Decoder parameters, sourced from the /DecoderParams entry associated with this filter.

Returns:

Decoded data.

class pyhanko.pdf_utils.filters.FlateDecode

Bases: Decoder

Implementation of the /FlateDecode filter.

Warning

Currently not all predictor values are supported. This may cause problems when extracting image data from PDF files.

decode(data: bytes, decode_params)

Decode a stream.

Parameters:
  • data – Data to decode.

  • decode_params – Decoder parameters, sourced from the /DecoderParams entry associated with this filter.

Returns:

Decoded data.

encode(data, decode_params=None)

Encode a stream.

Parameters:
  • data – Data to encode.

  • decode_params – Encoder parameters, sourced from the /DecoderParams entry associated with this filter.

Returns:

Encoded data.

pyhanko.pdf_utils.filters.get_generic_decoder(name: str) Decoder

Instantiate a specific stream filter decoder type by (PDF) name.

The following names are recognised:

  • /FlateDecode or /Fl for the decoder implementing Flate

    compression.

  • /ASCIIHexDecode or /AHx for the decoder that converts bytes to their hexadecimal representations.

  • /ASCII85Decode or /A85 for the decoder that converts byte strings to a base-85 textual representation.

Warning

/Crypt is a special case because it requires access to the document’s security handler.

Warning

LZW compression is currently unsupported, as are most compression methods that are used specifically for image data.

Parameters:

name – Name of the decoder to instantiate.

pyhanko.pdf_utils.generic module

Implementation of PDF object types and other generic functionality. The internals were imported from PyPDF2, with modifications.

See here for the original license of the PyPDF2 project.

class pyhanko.pdf_utils.generic.Dereferenceable

Bases: object

Represents an opaque reference to a PDF object associated with a PDF Handler (see PdfHandler).

This can either be a reference to an object with an object ID (see Reference) or a reference to the trailer of a PDF document (see TrailerReference).

get_object() PdfObject

Retrieve the PDF object backing this dereferenceable.

Returns:

A PdfObject.

get_pdf_handler()

Return the PDF handler associated with this dereferenceable.

Returns:

a PdfHandler.

class pyhanko.pdf_utils.generic.Reference(idnum: int, generation: int = 0, pdf: object = None)

Bases: Dereferenceable

A reference to an object with a certain ID and generation number, with a PDF handler attached to it.

Warning

Contrary to what one might expect, the generation number does not indicate the document revision in which the object was modified. In fact, nonzero generation numbers are exceedingly rare these days; in most real-world PDF files, objects are simply overridden without ever increasing the generation number.

Except in very specific circumstances, dereferencing a Reference will return the most recent version of the object with the stated object ID and generation number.

idnum: int

The object’s ID.

generation: int = 0

The object’s generation number (usually 0)

pdf: object = None

The PDF handler associated with this reference, an instance of PdfHandler.

Warning

This field is ignored when hashing or comparing Reference objects, so it is the API user’s responsibility to not mix up references originating from unrelated PDF handlers.

get_object() PdfObject

Retrieve the PDF object backing this dereferenceable.

Returns:

A PdfObject.

get_pdf_handler()

Return the PDF handler associated with this dereferenceable.

Returns:

a PdfHandler.

class pyhanko.pdf_utils.generic.TrailerReference(reader)

Bases: Dereferenceable

A reference to the trailer of a PDF document.

Warning

Since the trailer does not have a well-defined object ID in files with “classical” cross-reference tables (as opposed to cross-reference streams), this is not a subclass of Reference.

Parameters:

reader – a PdfFileReader

get_object() PdfObject

Retrieve the PDF object backing this dereferenceable.

Returns:

A PdfObject.

get_pdf_handler()

Return the PDF handler associated with this dereferenceable.

Returns:

a PdfHandler.

class pyhanko.pdf_utils.generic.PdfObject

Bases: object

Superclass for all PDF objects.

container_ref: Dereferenceable | None = None

For objects read from a file, container_ref points to the unique addressable object containing this object.

Note

Consider the following object definition in a PDF file:

4 0 obj
<< /Foo (Bar) >>

This declares a dictionary with ID 4, but the values /Foo and (Bar) are also PDF objects (a name and a string, respectively). All of these will have container_ref given by a Reference with object ID 4 and generation number 0.

If an object is part of the trailer of a PDF file, container_ref will be a TrailerReference. For newly created objects (i.e. those not read from a file), container_ref is always None.

get_container_ref() Dereferenceable

Return a reference to the closest parent object containing this object. Raises an error if no such reference can be found.

get_object()

Resolves indirect references.

Returns:

self, unless an instance of IndirectObject.

write_to_stream(stream, handler: SecurityHandler | None = None, container_ref: Reference | None = None)

Abstract method to render this object to an output stream.

Parameters:
  • stream – An output stream.

  • container_ref – Local encryption key.

  • handler – Security handler

class pyhanko.pdf_utils.generic.IndirectObject(idnum, generation, pdf)

Bases: PdfObject, Dereferenceable

Thin wrapper around a Reference, implementing both the Dereferenceable and PdfObject interfaces.

Warning

For many purposes, this class is functionally interchangeable with Reference, with one important exception: IndirectObject instances pointing to the same reference but occurring at different locations in the file may have distinct container_ref values.

get_object()
Returns:

The PDF object this reference points to.

get_pdf_handler()

Return the PDF handler associated with this dereferenceable.

Returns:

a PdfHandler.

property idnum: int
Returns:

the object ID of this reference.

property generation
Returns:

the generation number of this reference.

write_to_stream(stream, handler: SecurityHandler | None = None, container_ref=None)

Abstract method to render this object to an output stream.

Parameters:
  • stream – An output stream.

  • container_ref – Local encryption key.

  • handler – Security handler

static read_from_stream(stream, container_ref: Dereferenceable)
class pyhanko.pdf_utils.generic.NullObject

Bases: PdfObject

PDF null object.

All instances are treated as equal and falsy.

write_to_stream(stream, handler: SecurityHandler | None = None, container_ref=None)

Abstract method to render this object to an output stream.

Parameters:
  • stream – An output stream.

  • container_ref – Local encryption key.

  • handler – Security handler

static read_from_stream(stream)
class pyhanko.pdf_utils.generic.BooleanObject(value)

Bases: PdfObject

PDF boolean value.

write_to_stream(stream, handler: SecurityHandler | None = None, container_ref=None)

Abstract method to render this object to an output stream.

Parameters:
  • stream – An output stream.

  • container_ref – Local encryption key.

  • handler – Security handler

static read_from_stream(stream)
class pyhanko.pdf_utils.generic.FloatObject(value='0')

Bases: Decimal, PdfObject

PDF Float object.

Internally, these are treated as decimals (and therefore actually fixed-point objects, to be precise).

as_numeric()
Returns:

a Python float value for this object.

write_to_stream(stream, handler: SecurityHandler | None = None, container_ref=None)

Abstract method to render this object to an output stream.

Parameters:
  • stream – An output stream.

  • container_ref – Local encryption key.

  • handler – Security handler

class pyhanko.pdf_utils.generic.NumberObject(value)

Bases: int, PdfObject

PDF number object. This is the PDF type for integer values.

NumberPattern = re.compile(b'[^+-.0-9]')
ByteDot = b'.'
as_numeric()
Returns:

a Python int value for this object.

write_to_stream(stream, handler: SecurityHandler | None = None, container_ref=None)

Abstract method to render this object to an output stream.

Parameters:
  • stream – An output stream.

  • container_ref – Local encryption key.

  • handler – Security handler

static read_from_stream(stream)
class pyhanko.pdf_utils.generic.ByteStringObject

Bases: bytes, PdfObject

PDF bytestring class.

property original_bytes

For compatibility with TextStringObject.original_bytes

write_to_stream(stream, handler: SecurityHandler | None = None, container_ref=None)

Abstract method to render this object to an output stream.

Parameters:
  • stream – An output stream.

  • container_ref – Local encryption key.

  • handler – Security handler

class pyhanko.pdf_utils.generic.TextStringObject

Bases: str, PdfObject

PDF text string object.

autodetected_encoding: TextStringEncoding | None = None

Autodetected encoding when parsing the file.

force_output_encoding: TextStringEncoding | None = None

Output encoding to use when serialising the string. The default is to try PDFDocEncoding first, and fall back to UTF-16BE.

property original_bytes

Retrieve the original bytes of the string as specified in the source file.

This may be necessary if this string was misidentified as a text string.

write_to_stream(stream, handler: SecurityHandler | None = None, container_ref=None)

Abstract method to render this object to an output stream.

Parameters:
  • stream – An output stream.

  • container_ref – Local encryption key.

  • handler – Security handler

class pyhanko.pdf_utils.generic.NameObject

Bases: str, PdfObject

PDF name object. These are valid Python strings, but names and strings are treated differently in the PDF specification, so proper care is required.

write_to_stream(stream, handler: SecurityHandler | None = None, container_ref=None)

Abstract method to render this object to an output stream.

Parameters:
  • stream – An output stream.

  • container_ref – Local encryption key.

  • handler – Security handler

static read_from_stream(stream)
class pyhanko.pdf_utils.generic.ArrayObject(iterable=(), /)

Bases: list, PdfObject

PDF array object. This class extends from Python’s list class, and supports its interface.

Warning

Contrary to the case of dictionary objects, PyPDF2 does not transparently dereference array entries when accessed using __getitem__(). For usability & consistency reasons, I decided to depart from that and dereference automatically. This makes the behaviour of ArrayObject consistent with DictionaryObject.

That said, some vestiges of the old PyPDF2 behaviour may linger in the codebase. I’ll fix those as I get to them.

raw_get(index, decrypt: EncryptedObjAccess = EncryptedObjAccess.TRANSPARENT)

Changed in version 0.14.0: decrypt parameter is no longer boolean

Get a value from an array without dereferencing. In other words, if the value corresponding to the given key is of type IndirectObject, the indirect reference will not be resolved.

Parameters:
Returns:

A PdfObject.

write_to_stream(stream, handler: SecurityHandler | None = None, container_ref=None)

Abstract method to render this object to an output stream.

Parameters:
  • stream – An output stream.

  • container_ref – Local encryption key.

  • handler – Security handler

static read_from_stream(stream, container_ref)
class pyhanko.pdf_utils.generic.DictionaryObject(dict_data=None)

Bases: dict, PdfObject

A PDF dictionary object.

Keys in a PDF dictionary are PDF names, and values are PDF objects.

When accessing a key using the standard __getitem__() syntax, IndirectObject references will be resolved.

raw_get(key: NameObject | str, decrypt: EncryptedObjAccess = EncryptedObjAccess.TRANSPARENT)

Changed in version 0.14.0: decrypt parameter is no longer boolean

Get a value from a dictionary without dereferencing. In other words, if the value corresponding to the given key is of type IndirectObject, the indirect reference will not be resolved.

Parameters:
Returns:

A PdfObject.

setdefault(key, value=None)

Insert key with a value of default if key is not in the dictionary.

Return the value for key if key is in the dictionary, else default.

get_and_apply(key, function: Callable[[PdfObject], Any], *, raw=False, default=None)
get_value_as_reference(key, optional=False) Reference
write_to_stream(stream, handler: SecurityHandler | None = None, container_ref=None)

Abstract method to render this object to an output stream.

Parameters:
  • stream – An output stream.

  • container_ref – Local encryption key.

  • handler – Security handler

static read_from_stream(stream, container_ref: Dereferenceable, as_metadata_stream: bool = False)
class pyhanko.pdf_utils.generic.StreamObject(dict_data: dict | None = None, stream_data: bytes | None = None, encoded_data: bytes | None = None, handler: SecurityHandler | None = None)

Bases: DictionaryObject

PDF stream object.

Essentially, a PDF stream is a dictionary object with a binary blob of data attached. This data can be encoded by various filters (not all of which are currently supported, see filters).

A stream object can be initialised with encoded or decoded data. The former is used by reader.PdfFileReader to provide on-demand decoding, with writer.BasePdfFileWriter and its subclasses working the other way around.

Note

The StreamObject class manages some of its dictionary keys by itself. This is partly the case for the various /Filter and /DecodeParms entries, but also for the /Length entry. The latter will be overwritten as necessary.

Parameters:
  • dict_data – The dictionary data for this stream object.

  • stream_data – The (unencoded) stream data.

  • encoded_data

    The encoded stream data.

    Warning

    Ordinarily, a stream can be initialised either from decoded and from encoded data.

    If both stream_data and encoded_data are provided, the caller is responsible for making sure that both are compatible given the currently relevant filter configuration.

  • handler – A reference to the currently active pyhanko.pdf_utils.crypt.SecurityHandler. This is only necessary if the stream requires crypt filters.

add_crypt_filter(name='/Identity', params=None, handler: SecurityHandler | None = None)
strip_filters()

Ensure the stream is decoded, and remove any filters.

property data: bytes

Return the decoded stream data as bytes. If the stream hasn’t been decoded yet, it will be decoded on-the-fly.

Raises:

.misc.PdfStreamError – If the stream could not be decoded.

property encoded_data: bytes

Return the encoded stream data as bytes. If the stream hasn’t been encoded yet, it will be encoded on-the-fly.

Raises:

.misc.PdfStreamError – If the stream could not be encoded.

apply_filter(filter_name, params=None, allow_duplicates: bool | None = True)

Apply a new filter to this stream. This filter will be prepended to any existing filters. This means that is is placed last in the encoding order, but first in the decoding order.

Note: Calling this method on an encoded stream will first cause the stream to be decoded using the filters already present. The cached value for the encoded stream data will be cleared.

Parameters:
  • filter_name – Name of the filter (see DECODERS)

  • params – Parameters to the filter (will be written to /DecodeParms if not None)

  • allow_duplicates – If None, silently ignore duplicate filters. If False, raise ValueError when attempting to add a duplicate filter. If True (default), duplicate filters are allowed.

compress()

Convenience method to add a /FlateDecode filter with default settings, if one is not already present.

Note: compression is not actually applied until the stream is written.

property is_embedded_file_stream
write_to_stream(stream, handler: SecurityHandler | None = None, container_ref=None)

Abstract method to render this object to an output stream.

Parameters:
  • stream – An output stream.

  • container_ref – Local encryption key.

  • handler – Security handler

pyhanko.pdf_utils.generic.read_object(stream, container_ref: Dereferenceable, as_metadata_stream: bool = False) PdfObject

Read a PDF object from an input stream.

Note

The container_ref parameter tells the API which reference to register when the returned object is modified in an incremental update. See also here here for further information.

Parameters:
  • stream – An input stream.

  • container_ref

    A reference to an object containing this one.

    Note: It is perfectly possible (and common) for container_ref to resolve to the return value of this function.

  • as_metadata_stream – Whether to dereference the object as an XMP metadata stream.

Returns:

A PdfObject.

pyhanko.pdf_utils.generic.pdf_name

alias of NameObject

pyhanko.pdf_utils.generic.pdf_string(string: str | bytes | bytearray) ByteStringObject | TextStringObject

Encode a string as a TextStringObject if possible, or a ByteStringObject otherwise.

Parameters:

string – A Python string.

pyhanko.pdf_utils.generic.pdf_date(dt: datetime) TextStringObject

Convert a datetime object into a PDF string. This function supports both timezone-aware and naive datetime objects.

Parameters:

dt – The datetime object to convert.

Returns:

A TextStringObject representing the datetime passed in.

class pyhanko.pdf_utils.generic.TextStringEncoding(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: Enum

Encodings for PDF text strings.

PDF_DOC = None

PDFDocEncoding (one-byte character codes; PDF-specific).

UTF16BE = (b'\xfe\xff', 'utf-16be')

UTF-16BE encoding.

UTF8 = (b'\xef\xbb\xbf', 'utf-8')

UTF-8 encoding (PDF 2.0)

UTF16LE = (b'\xff\xfe', 'utf-16le')

UTF-16LE encoding.

Note

This is strictly speaking invalid in PDF 2.0, but some authoring tools output such strings anyway (presumably due to the fact that it’s the default wide character encoding on Windows).

encode(string: str) bytes

Encode a string with BOM.

Parameters:

string – The string to encode.

Returns:

The encoded string.

decode(string: bytes | bytearray) str

Decode a string with BOM.

Parameters:

string – The string to encode.

Returns:

The encoded string.

Raises:

UnicodeDecodeError – Raised if decoding fails.

class pyhanko.pdf_utils.generic.EncryptedObjAccess(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: Enum

Defines what to do when an encrypted object is encountered when retrieving an object from a container.

PROXY = 0

Return the proxy object as-is, and leave further encryption/decryption handling to the caller.

TRANSPARENT = 1

Transparently decrypt the proxy’s content (similarly wrapping any sub-containers in DecryptedObjectProxy, so this applies recursively).

Note

This is the default in most situations, since it’s the least likely to get in the way of any APIs that are not explicitly aware of content encryption concerns.

RAW = 2

Return the underlying raw object as written, without attempting or deferring decryption.

class pyhanko.pdf_utils.generic.DecryptedObjectProxy(raw_object: PdfObject, handler)

Bases: PdfObject

Internal proxy class that allows transparent on-demand encryption of objects.

Warning

Most public-facing APIs won’t leave you to deal with these directly (that’s half the reason this class exists in the first place), and the API of this class is considered internal.

However, for reasons related to the historical PyPDF2 codebase from which pyHanko’s object handling code ultimately derives, there are some Python builtins that might cause these wrapper objects to inadvertently “leak”. Please tell us about such cases so we can make those types of access more convenient and robust.

Danger

The __eq__ implementation on this class is not safe for general use, due to the fact that certain structures in PDF are exempt from encryption. Only compare proxy objects with == in areas of the document where these exemptions don’t apply.

Parameters:
  • raw_object – A raw object, typically as-parsed from a PDF file.

  • handler – The security handler governing this object.

raw_object: PdfObject

The underlying raw object, in its encrypted state.

property decrypted: PdfObject

The decrypted PDF object exposed as a property.

If this object is a container object, its constituent parts will be wrapped in DecryptedObjectProxy as well, in order to defer further decryption until the values are requested through a getter method on the container.

write_to_stream(stream, handler: SecurityHandler | None = None, container_ref=None)

Abstract method to render this object to an output stream.

Parameters:
  • stream – An output stream.

  • container_ref – Local encryption key.

  • handler – Security handler

get_object()

Resolves indirect references.

Returns:

self, unless an instance of IndirectObject.

property container_ref

pyhanko.pdf_utils.images module

Utilities for embedding bitmap image data into PDF files.

The image data handling is done by Pillow.

Note

Note that also here we only support a subset of what the PDF standard provides for. Most RGB and grayscale images (with or without transparency) that can be read by PIL/Pillow can be used without issue. PNG images with an indexed palette backed by one of these colour spaces can also be used.

Currently there is no support for CMYK images or (direct) support for embedding JPEG-encoded image data as such, but these features may be added later.

pyhanko.pdf_utils.images.pil_image(img: Image, writer: BasePdfFileWriter)

This function writes a PIL/Pillow Image object to a PDF file writer, as an image XObject.

Parameters:
  • img – A Pillow Image object

  • writer – A PDF file writer

Returns:

A reference to the image XObject written.

class pyhanko.pdf_utils.images.PdfImage(image: Image | str, writer: BasePdfFileWriter | None = None, resources: PdfResources | None = None, name: str | None = None, opacity=None, box: BoxConstraints | None = None)

Bases: PdfContent

Wrapper class that implements the PdfContent interface for image objects.

Note

Instances of this class are reusable, in the sense that the implementation is aware of changes to the associated writer object. This allows the same image to be embedded into multiple files without instantiating a new PdfImage every time.

property image_ref: IndirectObject

Return a reference to the image XObject associated with this PdfImage instance. If no such reference is available, it will be created using pil_image(), and the result will be cached until the writer attribute changes (see set_writer()).

Returns:

An indirect reference to an image XObject.

render() bytes

Compile the content to graphics operators.

pyhanko.pdf_utils.incremental_writer module

Utility for writing incremental updates to existing PDF files.

class pyhanko.pdf_utils.incremental_writer.IncrementalPdfFileWriter(input_stream, prev: PdfFileReader | None = None, strict=True)

Bases: BasePdfFileWriter

Class to incrementally update existing files.

This BasePdfFileWriter subclass encapsulates a PdfFileReader instance in addition to exposing an interface to add and modify PDF objects.

Incremental updates to a PDF file append modifications to the end of the file. This is critical when the original file contents are not to be modified directly (e.g. when it contains digital signatures). It has the additional advantage of providing an automatic audit trail of sorts.

Parameters:
  • input_stream – Input stream to read current revision from.

  • strict – Ingest the source file in strict mode. The default is True.

  • prev – Explicitly pass in a PDF reader. This parameter is internal API.

IO_CHUNK_SIZE = 4096
classmethod from_reader(reader: PdfFileReader) IncrementalPdfFileWriter

Instantiate an incremental writer from a PDF file reader.

Parameters:

reader – A PdfFileReader object with a PDF to extend.

ensure_output_version(version)
get_object(ido, as_metadata_stream: bool = False)

Retrieve the object associated with the provided reference from this PDF handler.

Parameters:
  • ref – An instance of generic.Reference.

  • as_metadata_stream – Whether to dereference the object as an XMP metadata stream.

Returns:

A PDF object.

mark_update(obj_ref: Reference | IndirectObject)

Mark an object reference to be updated. This is only relevant for incremental updates, but is included as a no-op by default for interoperability reasons.

Parameters:

obj_ref – An indirect object instance or a reference.

update_container(obj: PdfObject)

Mark the container of an object (as indicated by the container_ref attribute on PdfObject) for an update.

As with mark_update(), this only applies to incremental updates, but defaults to a no-op.

Parameters:

obj – The object whose top-level container needs to be rewritten.

update_root()

Signal that the document catalog should be written to the output. Equivalent to calling mark_update() with root_ref.

set_info(info: IndirectObject | DictionaryObject | None)

Set the /Info entry of the document trailer.

Parameters:

info – The new /Info dictionary, as an indirect reference.

set_custom_trailer_entry(key: NameObject, value: PdfObject)

Set a custom, unmanaged entry in the document trailer or cross-reference stream dictionary.

Warning

Calling this method to set an entry that is managed by pyHanko internally (info dictionary, document catalog, etc.) has undefined results.

Parameters:
  • key – Dictionary key to use in the trailer.

  • value – Value to set

write(stream)

Write the contents of this PDF writer to a stream.

Parameters:

stream – A writable output stream.

property document_meta_view: DocumentMetadata
write_in_place()

Write the updated file contents in-place to the same stream as the input stream. This obviously requires a stream supporting both reading and writing operations.

encrypt(user_pwd)

Method to handle updates to encrypted files.

This method handles decrypting of the original file, and makes sure the resulting updated file is encrypted in a compatible way. The standard mandates that updates to encrypted files be effected using the same encryption settings. In particular, incremental updates cannot remove file encryption.

Parameters:

user_pwd – The original file’s user password.

Raises:

PdfReadError – Raised when there is a problem decrypting the file.

encrypt_pubkey(credential: EnvelopeKeyDecrypter)

Method to handle updates to files encrypted using public-key encryption.

The same caveats as encrypt() apply here.

Parameters:

credential – The EnvelopeKeyDecrypter handling the recipient’s private key.

Raises:

PdfReadError – Raised when there is a problem decrypting the file.

pyhanko.pdf_utils.layout module

Layout utilities (to be expanded)

exception pyhanko.pdf_utils.layout.LayoutError(msg: str, *args)

Bases: ValueError

Indicates an error in a layout computation.

exception pyhanko.pdf_utils.layout.BoxSpecificationError(msg: str | None = None)

Bases: LayoutError

Raised when a box constraint is over/underspecified.

class pyhanko.pdf_utils.layout.BoxConstraints(width: int | float | None = None, height: int | float | None = None, aspect_ratio: Fraction | None = None)

Bases: object

Represents a box of potentially variable width and height. Among other uses, this can be leveraged to produce a variably sized box with a fixed aspect ratio.

If width/height are not defined yet, they can be set by assigning to the width and height attributes.

property width: int
Returns:

The width of the box.

Raises:

BoxSpecificationError – if the box’s width could not be determined.

property width_defined: bool
Returns:

True if the box currently has a well-defined width, False otherwise.

property height: int
Returns:

The height of the box.

Raises:

BoxSpecificationError – if the box’s height could not be determined.

property height_defined: bool
Returns:

True if the box currently has a well-defined height, False otherwise.

property aspect_ratio: Fraction
Returns:

The aspect ratio of the box.

Raises:

BoxSpecificationError – if the box’s aspect ratio could not be determined.

property aspect_ratio_defined: bool
Returns:

True if the box currently has a well-defined aspect ratio, False otherwise.

class pyhanko.pdf_utils.layout.AxisAlignment(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: Enum

Class representing one-dimensional alignment along an axis.

ALIGN_MIN = 1

Align maximally towards the negative end of the axis.

ALIGN_MID = 2

Center content along the axis.

ALIGN_MAX = 3

Align maximally towards the positive end of the axis.

classmethod from_x_align(align_str: str) AxisAlignment

Convert from a horizontal alignment config string.

Parameters:

align_str – A string: ‘left’, ‘mid’ or ‘right’.

Returns:

An AxisAlignment value.

Raises:

ConfigurationError – on unexpected string inputs.

classmethod from_y_align(align_str: str) AxisAlignment

Convert from a vertical alignment config string.

Parameters:

align_str – A string: ‘bottom’, ‘mid’ or ‘top’.

Returns:

An AxisAlignment value.

Raises:

ConfigurationError – on unexpected string inputs.

property flipped
align(container_len: int, inner_len: int, pre_margin, post_margin) int
class pyhanko.pdf_utils.layout.Margins(left: int = 0, right: int = 0, top: int = 0, bottom: int = 0)

Bases: ConfigurableMixin

Class describing a set of margins.

left: int = 0
right: int = 0
top: int = 0
bottom: int = 0
classmethod uniform(num)

Return a set of uniform margins.

Parameters:

num – The uniform margin to apply to all four sides.

Returns:

Margins(num, num, num, num)

static effective(dim_name, container_len, pre, post)

Internal helper method to compute effective margins.

effective_width(width)

Compute width without margins.

Parameters:

width – The container width.

Returns:

The width after subtracting the left and right margins.

Raises:

LayoutError – if the container width is too short to accommodate the margins.

effective_height(height)

Compute height without margins.

Parameters:

height – The container height.

Returns:

The height after subtracting the top and bottom margins.

Raises:

LayoutError – if the container height is too short to accommodate the margins.

classmethod from_config(config_dict)

Attempt to instantiate an object of the class on which it is called, by means of the configuration settings passed in.

First, we check that the keys supplied in the dictionary correspond to data fields on the current class. Then, the dictionary is processed using the process_entries() method. The resulting dictionary is passed to the initialiser of the current class as a kwargs dict.

Parameters:

config_dict – A dictionary containing configuration values.

Returns:

An instance of the class on which it is called.

Raises:

ConfigurationError – when an unexpected configuration key is encountered or left unfilled, or when there is a problem processing one of the config values.

class pyhanko.pdf_utils.layout.InnerScaling(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: Enum

Class representing a scaling convention.

NO_SCALING = 1

Never scale content.

STRETCH_FILL = 2

Scale content to fill the entire container.

STRETCH_TO_FIT = 3

Scale content while preserving aspect ratio until either the maximal width or maximal height is reached.

SHRINK_TO_FIT = 4

Scale content down to fit in the container, while preserving the original aspect ratio.

classmethod from_config(config_str: str) InnerScaling

Convert from a configuration string.

Parameters:

config_str – A string: ‘none’, ‘stretch-fill’, ‘stretch-to-fit’, ‘shrink-to-fit’

Returns:

An InnerScaling value.

Raises:

ConfigurationError – on unexpected string inputs.

class pyhanko.pdf_utils.layout.SimpleBoxLayoutRule(x_align: AxisAlignment, y_align: AxisAlignment, margins: Margins = Margins(left=0, right=0, top=0, bottom=0), inner_content_scaling: InnerScaling = InnerScaling.SHRINK_TO_FIT)

Bases: ConfigurableMixin

Class describing alignment, scaling and margin rules for a box positioned inside another box.

x_align: AxisAlignment

Horizontal alignment settings.

y_align: AxisAlignment

Vertical alignment settings.

margins: Margins = Margins(left=0, right=0, top=0, bottom=0)

Container (inner) margins. Defaults to all zeroes.

inner_content_scaling: InnerScaling = 4

Inner content scaling rule.

classmethod process_entries(config_dict)

Hook method that can modify the configuration dictionary to overwrite or tweak some of their values (e.g. to convert string parameters into more complex Python objects)

Subclasses that override this method should call super().process_entries(), and leave keys that they do not recognise untouched.

Parameters:

config_dict – A dictionary containing configuration values.

Raises:

ConfigurationError – when there is a problem processing a relevant entry.

substitute_margins(new_margins: Margins) SimpleBoxLayoutRule
fit(container_box: BoxConstraints, inner_nat_width: int, inner_nat_height: int) Positioning

Position and possibly scale a box within a container, according to this layout rule.

Parameters:
  • container_boxBoxConstraints describing the container.

  • inner_nat_width – The inner box’s natural width.

  • inner_nat_height – The inner box’s natural height.

Returns:

A Positioning describing the scaling & position of the lower left corner of the inner box.

class pyhanko.pdf_utils.layout.Positioning(x_pos: int, y_pos: int, x_scale: float, y_scale: float)

Bases: ConfigurableMixin

Class describing the position and scaling of an object in a container.

x_pos: int

Horizontal coordinate

y_pos: int

Vertical coordinate

x_scale: float

Horizontal scaling

y_scale: float

Vertical scaling

as_cm()

Convenience method to convert this Positioning into a PDF cm operator.

Returns:

A byte string representing the cm operator corresponding to this Positioning.

pyhanko.pdf_utils.misc module

Utility functions for PDF library. Taken from PyPDF2 with modifications and additions, see here for the original license of the PyPDF2 project.

Generally, all of these constitute internal API, except for the exception classes.

exception pyhanko.pdf_utils.misc.PdfError(msg: str, *args)

Bases: Exception

exception pyhanko.pdf_utils.misc.PdfReadError(msg: str, *args)

Bases: PdfError

exception pyhanko.pdf_utils.misc.PdfStrictReadError(msg: str, *args)

Bases: PdfReadError

exception pyhanko.pdf_utils.misc.PdfWriteError(msg: str, *args)

Bases: PdfError

exception pyhanko.pdf_utils.misc.PdfStreamError(msg: str, *args)

Bases: PdfReadError

exception pyhanko.pdf_utils.misc.IndirectObjectExpected(msg: str | None = None)

Bases: PdfReadError

pyhanko.pdf_utils.misc.get_and_apply(dictionary: dict, key, function: Callable, *, default=None)
class pyhanko.pdf_utils.misc.OrderedEnum(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: Enum

Ordered enum (from the Python documentation)

class pyhanko.pdf_utils.misc.StringWithLanguage(value: str, lang_code: str | None = None, country_code: str | None = None)

Bases: object

A string with a language attached to it.

value: str
lang_code: str | None = None
country_code: str | None = None
pyhanko.pdf_utils.misc.is_regular_character(byte_value: int)
pyhanko.pdf_utils.misc.read_non_whitespace(stream, seek_back=False, allow_eof=False)

Finds and reads the next non-whitespace character (ignores whitespace).

pyhanko.pdf_utils.misc.read_until_whitespace(stream, maxchars: int | None = None) bytes

Reads non-whitespace characters and returns them. Stops upon encountering whitespace, or, if maxchars is not None, when maxchars is reached.

Parameters:
  • stream – stream to read

  • maxchars – maximal number of bytes to read before returning

pyhanko.pdf_utils.misc.read_until_delimiter(stream) bytes

Read until a token delimiter (i.e. a delimiter character or a PDF whitespace character) is encountered, and rewind the stream to the previous character.

Parameters:

stream – A stream.

Returns:

The bytes read.

pyhanko.pdf_utils.misc.read_until_regex(stream, regex, ignore_eof: bool = False)

Reads until the regular expression pattern matched (ignore the match) Raise PdfStreamError on premature end-of-file.

Parameters:
  • stream – stream to search

  • regex – regex to match

  • ignore_eof – if true, ignore end-of-line and return immediately

Raises:

PdfStreamError – on premature EOF

pyhanko.pdf_utils.misc.skip_over_whitespace(stream, stop_after_eol=False) bool

Similar to read_non_whitespace(), but returns a bool if more than one whitespace character was read.

Will return the cursor to before the first non-whitespace character encountered, or after the first end-of-line sequence if one is encountered.

pyhanko.pdf_utils.misc.skip_over_comment(stream) bool

Skip over a comment and position the cursor at the first byte after the EOL sequence following the comment. If there is no comment under the cursor, do nothing.

Parameters:

stream – stream to read

Returns:

True if a comment was read.

pyhanko.pdf_utils.misc.instance_test(cls)
pyhanko.pdf_utils.misc.peek(itr)
pyhanko.pdf_utils.misc.assert_writable_and_random_access(output)

Raise an error if the buffer in question is not writable, and return a boolean to indicate whether it supports random-access reading.

Parameters:

output

Returns:

pyhanko.pdf_utils.misc.prepare_rw_output_stream(output)

Prepare an output stream that supports both reading and writing. Intended to be used for writing & updating signed files: when producing a signature, we render the PDF to a byte buffer with placeholder values for the signature data, or straight to the provided output stream if possible.

More precisely: this function will return the original output stream if it is writable, readable and seekable. If the output parameter is None, not readable or not seekable, this function will return a BytesIO instance instead. If the output parameter is not None and not writable, IOError will be raised.

Parameters:

output – A writable file-like object, or None.

Returns:

A file-like object that supports reading, writing and seeking.

pyhanko.pdf_utils.misc.finalise_output(orig_output, returned_output)

Several internal APIs transparently replaces non-readable/seekable buffers with BytesIO for signing operations, but we don’t want to expose that to the public API user. This internal API function handles the unwrapping.

pyhanko.pdf_utils.misc.DEFAULT_CHUNK_SIZE = 4096

Default chunk size for stream I/O.

pyhanko.pdf_utils.misc.chunked_write(temp_buffer: bytearray, stream, output, max_read=None)
pyhanko.pdf_utils.misc.chunked_digest(temp_buffer: bytearray, stream, md, max_read=None)
pyhanko.pdf_utils.misc.chunk_stream(temp_buffer: bytearray | memoryview, stream, max_read=None) Iterable[bytearray | memoryview]
class pyhanko.pdf_utils.misc.ConsList(head: 'Optional[ListElem]', tail: 'Optional[ConsList[ListElem]]' = None)

Bases: Generic[ListElem]

head: ListElem | None
tail: ConsList[ListElem] | None = None
static empty() ConsList[ListElem]
static sing(value: ListElem) ConsList[ListElem]
property last: ListElem | None
cons(head: ListElem) ConsList[ListElem]
class pyhanko.pdf_utils.misc.Singleton(name, bases, dct)

Bases: type

pyhanko.pdf_utils.misc.rd(x)
pyhanko.pdf_utils.misc.isoparse(dt_str: str) datetime
pyhanko.pdf_utils.misc.lift_iterable_async(i: Iterable[X]) CancelableAsyncIterator[X]

pyhanko.pdf_utils.qr module

class pyhanko.pdf_utils.qr.PdfStreamQRImage(border, width, box_size, *args, **kwargs)

Bases: BaseImage

Quick-and-dirty implementation of the Image interface required by the qrcode package.

kind: str | None = 'PDF'
allowed_kinds: Tuple[str] | None = ('PDF',)
qr_color = (0, 0, 0)
new_image(**kwargs)

Build the image class. Subclasses should return the class created.

drawrect(row, col)

Draw a single rectangle of the QR code.

append_single_rect(command_stream, row, col)
format_qr_color()
setup_drawing_area()
render_command_stream()
save(stream, kind=None)

Save the image file.

process()

Processes QR code after completion

drawrect_context(row, col, active, context)

Draw a single rectangle of the QR code given the surrounding context

class pyhanko.pdf_utils.qr.PdfFancyQRImage(border, width, box_size, *_args, version, center_image: PdfContent | None = None, **kwargs)

Bases: PdfStreamQRImage

centerpiece_corner_radius = 0.2
save(stream, kind=None)

Save the image file.

process()

Processes QR code after completion

append_single_rect(command_stream, row, col)
is_major_position_pattern(row, col)
is_position_pattern(row, col)
draw_position_patterns()
draw_centerpiece()
setup_drawing_area()
render_command_stream()
pyhanko.pdf_utils.qr.rounded_square(x_pos: float, y_pos: float, sz: float, rad: float) List[bytes]

Add a subpath of a square with rounded corners at the given position. Doesn’t include any painting or clipping operations.

The path is drawn counterclockwise.

Parameters:
  • x_pos – The x-coordinate of the enveloping square’s lower left corner.

  • y_pos – The y-coordinate of the enveloping square’s lower left corner.

  • sz – The side length of the enveloping square.

  • rad – The corner radius.

Returns:

A list of graphics operators.

pyhanko.pdf_utils.reader module

Utility to read PDF files. Contains code from the PyPDF2 project; see here for the original license.

The implementation was tweaked with the express purpose of facilitating historical inspection and auditing of PDF files with multiple revisions through incremental updates. This comes at a cost, and future iterations of this module may offer more flexibility in terms of the level of detail with which file size is scrutinised.

class pyhanko.pdf_utils.reader.PdfFileReader(stream, strict: bool = True)

Bases: PdfHandler

Class implementing functionality to read a PDF file and cache certain data about it.

xrefs: XRefCache
last_startxref = None
has_xref_stream = False
property security_handler
property document_meta_view: DocumentMetadata
property input_version
property encrypt_dict: DictionaryObject | None
property trailer_view: DictionaryObject

Returns a view of the document trailer of the document represented by this PdfHandler instance.

The view is effectively read-only, in the sense that any writes will not be reflected in the actual trailer (if the handler supports writing, that is).

Returns:

A generic.DictionaryObject representing the current state of the document trailer.

property root_ref: Reference
Returns:

A reference to the document catalog of this PDF handler.

property document_id: Tuple[bytes, bytes]
get_historical_root(revision: int)

Get the document catalog for a specific revision.

Parameters:

revision – The revision to query, the oldest one being 0.

Returns:

The value of the document catalog dictionary for that revision.

property total_revisions: int
Returns:

The total number of revisions made to this file.

get_object(ref, revision=None, never_decrypt=False, transparent_decrypt=True, as_metadata_stream=False)

Read an object from the input stream.

Parameters:
  • refReference to the object.

  • revision – Revision number, to return the historical value of a reference. This always bypasses the cache. The oldest revision is numbered 0. See also HistoricalResolver.

  • never_decrypt – Skip decryption step (only needed for parsing /Encrypt)

  • transparent_decrypt

    If True, all encrypted objects are transparently decrypted by default (in the sense that a user of the API in a PyPDF2 compatible way would only “see” decrypted objects). If False, this method may return a proxy object that still allows access to the “original”.

    Danger

    The encryption parameters are considered internal, undocumented API, and subject to change without notice.

  • as_metadata_stream – Whether to dereference the object as an XMP metadata stream.

Returns:

A PdfObject.

Raises:

PdfReadError – Raised if there is an issue reading the object from the file.

cache_get_indirect_object(generation, idnum)
cache_indirect_object(generation, idnum, obj)
decrypt(password: str | bytes) AuthResult

When using an encrypted PDF file with the standard PDF encryption handler, this function will allow the file to be decrypted. It checks the given password against the document’s user password and owner password, and then stores the resulting decryption key if either password is correct.

Both legacy encryption schemes and PDF 2.0 encryption (based on AES-256) are supported.

Danger

Supplying either user or owner password will work. Cryptographically, both allow the decryption key to be computed, but processors are expected to adhere to the /P flags in the encryption dictionary when accessing a file with the user password. Currently, pyHanko does not enforce these restrictions, but it may in the future.

Danger

One should also be aware that the legacy encryption schemes used prior to PDF 2.0 are (very) weak, and we only support them for compatibility reasons. Under no circumstances should these still be used to encrypt new files.

Parameters:

password – The password to match.

decrypt_pubkey(credential: EnvelopeKeyDecrypter) AuthResult

Decrypt a PDF file encrypted using public-key encryption by providing a credential representing the private key of one of the recipients.

Danger

The same caveats as in decrypt() w.r.t. permission handling apply to this method.

Danger

The robustness of the public key cipher being used is not the only factor in the security of public-key encryption in PDF. The standard still permits weak schemes to encrypt the actual file data and file keys. PyHanko uses sane defaults everywhere, but other software may not.

Parameters:

credential – The EnvelopeKeyDecrypter handling the recipient’s private key.

property encrypted
Returns:

True if a document is encrypted, False otherwise.

get_historical_resolver(revision: int) HistoricalResolver

Return a PdfHandler instance that provides a view on the file at a specific revision.

Parameters:

revision – The revision number to use, with 0 being the oldest.

Returns:

An instance of HistoricalResolver.

property embedded_signatures
Returns:

The signature objects embedded in this document, in signing order; see EmbeddedPdfSignature.

property embedded_regular_signatures
Returns:

The signature objects of type /Sig embedded in this document, in signing order; see EmbeddedPdfSignature.

property embedded_timestamp_signatures
Returns:

The signature objects of type /DocTimeStamp embedded in this document, in signing order; see EmbeddedPdfSignature.

class pyhanko.pdf_utils.reader.HistoricalResolver(reader: PdfFileReader, revision)

Bases: PdfHandler

PdfHandler implementation that provides a view on a particular revision of a PDF file.

Instances of HistoricalResolver should be created by calling the get_historical_resolver() method on a PdfFileReader object.

Instances of this class cache the result of get_object() calls.

Danger

This class is documented, but is nevertheless considered internal API, and easy to misuse.

In particular, the container_ref attribute must not be relied upon for objects retrieved from a HistoricalResolver. Internally, it is only used to make lazy decryption work in historical revisions.

Note

Be aware that instances of this class transparently rewrite the PDF handler associated with any reference objects returned from the reader, so calling get_object() on an indirect reference object will cause the reference to be resolved within the selected revision.

property document_meta_view: DocumentMetadata
property document_id: Tuple[bytes, bytes]
property trailer_view: DictionaryObject

Returns a view of the document trailer of the document represented by this PdfHandler instance.

The view is effectively read-only, in the sense that any writes will not be reflected in the actual trailer (if the handler supports writing, that is).

Returns:

A generic.DictionaryObject representing the current state of the document trailer.

get_object(ref: Reference, as_metadata_stream: bool = False)

Retrieve the object associated with the provided reference from this PDF handler.

Parameters:
  • ref – An instance of generic.Reference.

  • as_metadata_stream – Whether to dereference the object as an XMP metadata stream.

Returns:

A PDF object.

property root_ref: Reference
Returns:

A reference to the document catalog of this PDF handler.

explicit_refs_in_revision()
refs_freed_in_revision()
object_streams_used()
is_ref_available(ref: Reference) bool

Check if the reference in question was in scope for this revision. This call doesn’t care about the specific semantics of free vs. used objects; it conservatively answers ‘no’ in any situation where the object ID _could_ have been assigned by the revision in question.

Parameters:

ref – A reference object (usually one written to by a newer revision)

Returns:

True if the reference is unassignable, False otherwise.

collect_dependencies(obj: PdfObject, since_revision=None) Set[Reference]

Collect all indirect references used by an object and its descendants.

Parameters:
  • obj – The object to inspect.

  • since_revision

    Optionally specify a revision number that tells the scanner to only include objects IDs that were added in that revision or later.

    Warning

    In particular, this means that the scanner will not recurse into older objects either.

Returns:

A set of Reference objects.

pyhanko.pdf_utils.reader.parse_catalog_version(version_str) Tuple[int, int] | None
class pyhanko.pdf_utils.reader.RawPdfPath(*path: str | int)

Bases: object

Class to model raw paths in a file.

This class is internal API.

walk_nodes(from_obj, transparent_dereference=True) Generator[Tuple[int | str | None, PdfObject], None, None]
access_on(from_obj, dereference_last=True) PdfObject
access_reference_on(from_obj) Reference
pyhanko.pdf_utils.reader.process_data_at_eof(stream) int

Auxiliary function that reads backwards from the current position in a stream to find the EOF marker and startxref value

This is internal API.

Parameters:

stream – A stream to read from

Returns:

The value of the startxref pointer, if found. Otherwise a PdfReadError is raised.

pyhanko.pdf_utils.rw_common module

Utilities common to reading and writing PDF files.

class pyhanko.pdf_utils.rw_common.PdfHandler

Bases: object

Abstract class providing a general interface for quering objects in PDF readers and writers alike.

get_object(ref: Reference, as_metadata_stream: bool = False)

Retrieve the object associated with the provided reference from this PDF handler.

Parameters:
  • ref – An instance of generic.Reference.

  • as_metadata_stream – Whether to dereference the object as an XMP metadata stream.

Returns:

A PDF object.

property trailer_view: DictionaryObject

Returns a view of the document trailer of the document represented by this PdfHandler instance.

The view is effectively read-only, in the sense that any writes will not be reflected in the actual trailer (if the handler supports writing, that is).

Returns:

A generic.DictionaryObject representing the current state of the document trailer.

property document_meta_view: DocumentMetadata
property root_ref: Reference
Returns:

A reference to the document catalog of this PDF handler.

property root: DictionaryObject
Returns:

The document catalog of this PDF handler.

property document_id: Tuple[bytes, bytes]
find_page_container(page_ix)

Retrieve the node in the page tree containing the page with index page_ix, along with the necessary objects to modify it in an incremental update scenario.

Parameters:

page_ix – The (zero-indexed) number of the page for which we want to retrieve the parent. A negative number counts pages from the back of the document, with index -1 referring to the last page.

Returns:

A triple with the /Pages object (or a reference to it), the index of the target page in said /Pages object, and a (possibly inherited) resource dictionary.

find_page_for_modification(page_ix)

Retrieve the page with index page_ix from the page tree, along with the necessary objects to modify it in an incremental update scenario.

Parameters:

page_ix – The (zero-indexed) number of the page to retrieve. A negative number counts pages from the back of the document, with index -1 referring to the last page.

Returns:

A tuple with a reference to the page object and a (possibly inherited) resource dictionary.

pyhanko.pdf_utils.text module

Utilities related to text rendering & layout.

class pyhanko.pdf_utils.text.TextStyle(font: ~pyhanko.pdf_utils.font.api.FontEngineFactory = <factory>, font_size: int = 10, leading: int | None = None)

Bases: ConfigurableMixin

Container for basic test styling settings.

font: FontEngineFactory

The FontEngineFactory to be used for this text style. Defaults to Courier (as a non-embedded standard font).

font_size: int = 10

Font size to be used.

leading: int | None = None

Text leading. If None, the font_size parameter is used instead.

classmethod process_entries(config_dict)

Hook method that can modify the configuration dictionary to overwrite or tweak some of their values (e.g. to convert string parameters into more complex Python objects)

Subclasses that override this method should call super().process_entries(), and leave keys that they do not recognise untouched.

Parameters:

config_dict – A dictionary containing configuration values.

Raises:

ConfigurationError – when there is a problem processing a relevant entry.

class pyhanko.pdf_utils.text.TextBoxStyle(font: ~pyhanko.pdf_utils.font.api.FontEngineFactory = <factory>, font_size: int = 10, leading: int | None = None, border_width: int = 0, box_layout_rule: ~pyhanko.pdf_utils.layout.SimpleBoxLayoutRule | None = None, vertical_text: bool = False)

Bases: TextStyle

Extension of TextStyle for use in text boxes.

border_width: int = 0

Border width, if applicable.

box_layout_rule: SimpleBoxLayoutRule | None = None

Layout rule to nest the text within its containing box.

Warning

This only affects the position of the text object, not the alignment of the text within.

vertical_text: bool = False

Switch layout code to vertical mode instead of horizontal mode.

class pyhanko.pdf_utils.text.TextBox(style: TextBoxStyle, writer, resources: PdfResources | None = None, box: BoxConstraints | None = None, font_name='F1')

Bases: PdfContent

Implementation of a text box that implements the PdfContent interface.

Note

Text boxes currently don’t offer automatic word wrapping.

put_string_line(txt)
property content_lines
Returns:

Text content of the text box, broken up into lines.

property content
Returns:

The actual text content of the text box. This is a modifiable property.

In textboxes that don’t have a fixed size, setting this property can cause the text box to be resized.

property leading
Returns:

The effective leading value, i.e. the leading attribute of the associated TextBoxStyle, or font_size if not specified.

render()

Compile the content to graphics operators.

pyhanko.pdf_utils.writer module

Utilities for writing PDF files. Contains code from the PyPDF2 project; see here for the original license.

class pyhanko.pdf_utils.writer.BasePdfFileWriter(root: IndirectObject | DictionaryObject, info: IndirectObject | DictionaryObject | None, document_id: ArrayObject, obj_id_start: int = 0, stream_xrefs: bool = True)

Bases: PdfHandler

Base class for PDF writers.

output_version = (1, 7)

Output version to be declared in the output file.

stream_xrefs: bool

Boolean controlling whether or not the output file will contain its cross-references in stream format, or as a classical XRef table.

The default for new files is True. For incremental updates, the writer adapts to the system used in the previous iteration of the document (as mandated by the standard).

get_subset_collection(base_postscript_name: str)
property document_meta: DocumentMetadata
property document_meta_view: DocumentMetadata
ensure_output_version(version)
set_info(info: IndirectObject | DictionaryObject | None) IndirectObject | None

Set the /Info entry of the document trailer.

Parameters:

info – The new /Info dictionary, as an indirect reference.

set_custom_trailer_entry(key: NameObject, value: PdfObject)

Set a custom, unmanaged entry in the document trailer or cross-reference stream dictionary.

Warning

Calling this method to set an entry that is managed by pyHanko internally (info dictionary, document catalog, etc.) has undefined results.

Parameters:
  • key – Dictionary key to use in the trailer.

  • value – Value to set

property document_id: Tuple[bytes, bytes]
mark_update(obj_ref: Reference | IndirectObject)

Mark an object reference to be updated. This is only relevant for incremental updates, but is included as a no-op by default for interoperability reasons.

Parameters:

obj_ref – An indirect object instance or a reference.

update_container(obj: PdfObject)

Mark the container of an object (as indicated by the container_ref attribute on PdfObject) for an update.

As with mark_update(), this only applies to incremental updates, but defaults to a no-op.

Parameters:

obj – The object whose top-level container needs to be rewritten.

property root_ref: Reference
Returns:

A reference to the document catalog.

update_root()

Signal that the document catalog should be written to the output. Equivalent to calling mark_update() with root_ref.

register_extension(ext: DeveloperExtension)
get_object(ido, as_metadata_stream: bool = False)

Retrieve the object associated with the provided reference from this PDF handler.

Parameters:
  • ref – An instance of generic.Reference.

  • as_metadata_stream – Whether to dereference the object as an XMP metadata stream.

Returns:

A PDF object.

allocate_placeholder() IndirectObject

Allocate an object reference to populate later. Calls to get_object() for this reference will return NullObject until it is populated using add_object().

This method is only relevant in certain advanced contexts where an object ID needs to be known before the object it refers to can be built; chances are you’ll never need it.

Returns:

A IndirectObject instance referring to the object just allocated.

add_object(obj, obj_stream: ObjectStream | None = None, idnum=None) IndirectObject

Add a new object to this writer.

Parameters:
  • obj – The object to add.

  • obj_stream – An object stream to add the object to.

  • idnum – Manually specify the object ID of the object to be added. This is only allowed for object IDs that have previously been allocated using allocate_placeholder().

Returns:

A IndirectObject instance referring to the object just added.

prepare_object_stream(compress=True)

Prepare and return a new ObjectStream object.

Parameters:

compress – Indicates whether the resulting object stream should be compressed.

Returns:

An ObjectStream object.

property trailer_view: DictionaryObject

Returns a view of the document trailer of the document represented by this PdfHandler instance.

The view is effectively read-only, in the sense that any writes will not be reflected in the actual trailer (if the handler supports writing, that is).

Returns:

A generic.DictionaryObject representing the current state of the document trailer.

write(stream)

Write the contents of this PDF writer to a stream.

Parameters:

stream – A writable output stream.

register_annotation(page_ref, annot_ref)

Register an annotation to be added to a page. This convenience function takes care of calling mark_update() where necessary.

Parameters:
  • page_ref – Reference to the page object involved.

  • annot_ref – Reference to the annotation object to be added.

insert_page(new_page, after=None)

Insert a page object into the tree.

Parameters:
  • new_page – Page object to insert.

  • after – Page number (zero-indexed) after which to insert the page.

Returns:

A reference to the newly inserted page.

import_object(obj: PdfObject, obj_stream: ObjectStream | None = None) PdfObject

Deep-copy an object into this writer, dealing with resolving indirect references in the process.

Danger

The table mapping indirect references in the input to indirect references in the writer is not preserved between calls. Concretely, this means that invoking import_object() twice on the same input reader may cause object duplication.

Parameters:
  • obj – The object to import.

  • obj_stream

    The object stream to import objects into.

    Note

    Stream objects and bare references will not be put into the object stream; the standard forbids this.

Returns:

The object as associated with this writer. If the input object was an indirect reference, a dictionary (incl. streams) or an array, the returned value will always be a new instance.

import_page_as_xobject(other: PdfHandler, page_ix=0, inherit_filters=True)

Import a page content stream from some other PdfHandler into the current one as a form XObject.

Parameters:
  • other – A PdfHandler

  • page_ix – Index of the page to copy (default: 0)

  • inherit_filters – Inherit the content stream’s filters, if present.

Returns:

An IndirectObject referring to the page object as added to the current reader.

add_stream_to_page(page_ix, stream_ref, resources=None, prepend=False)

Append an indirect stream object to a page in a PDF as a content stream.

Parameters:
  • page_ix – Index of the page to modify. The first page has index 0.

  • stream_refIndirectObject reference to the stream object to add.

  • resources – Resource dictionary containing resources to add to the page’s existing resource dictionary.

  • prepend – Prepend the content stream to the list of content streams, as opposed to appending it to the end. This has the effect of causing the stream to be rendered underneath the already existing content on the page.

Returns:

An IndirectObject reference to the page object that was modified.

merge_resources(orig_dict: DictionaryObject, new_dict: DictionaryObject) bool

Update an existing resource dictionary object with data from another one. Returns True if the original dict object was modified directly.

The caller is responsible for avoiding name conflicts with existing resources.

class pyhanko.pdf_utils.writer.PageObject(contents, media_box, resources=None)

Bases: DictionaryObject

Subclass of DictionaryObject that handles some of the initialisation boilerplate for page objects.

class pyhanko.pdf_utils.writer.PdfFileWriter(stream_xrefs=True, init_page_tree=True, info=None)

Bases: BasePdfFileWriter

Class to write new PDF files.

encrypt(owner_pass, user_pass=None, **kwargs)

Mark this document to be encrypted with PDF 2.0 encryption (AES-256).

Caution

While pyHanko supports legacy PDF encryption as well, the API to create new documents using outdated encryption is left largely undocumented on purpose to discourage its use.

This caveat does not apply to incremental updates added to existing documents.

Danger

The PDF 2.0 standard mandates AES-256 in CBC mode, and also includes 12 bytes of known plaintext by design. This implies that a sufficiently knowledgeable attacker can inject arbitrary content into your encrypted files without knowledge of the password.

Adding a digital signature to the encrypted document is not a foolproof way to deal with this either, since most viewers will still allow the document to be opened before signatures are validated, and therefore end users are still exposed to potentially malicious content.

Until the standard supports authenticated encryption schemes, you should never rely on its encryption provisions if tampering is a concern.

Parameters:
  • owner_pass – The desired owner password.

  • user_pass – The desired user password (defaults to the owner password if not specified)

  • kwargs – Other keyword arguments to be passed to StandardSecurityHandler.build_from_pw().

encrypt_pubkey(recipients: List[Certificate], **kwargs)

Mark this document to be encrypted with PDF 2.0 public key encryption. The certificates passed in should be RSA certificates.

PyHanko defaults to AES-256 to encrypt the actual file contents. The seed used to derive the file encryption key is also encrypted using AES-256 and bundled in a CMS EnvelopedData object. The envelope key is then encrypted separately for each recipient, using their respective public keys.

Caution

The caveats for encrypt() also apply here.

Parameters:
set_custom_trailer_entry(key: NameObject, value: PdfObject)

Set a custom, unmanaged entry in the document trailer or cross-reference stream dictionary.

Warning

Calling this method to set an entry that is managed by pyHanko internally (info dictionary, document catalog, etc.) has undefined results.

Parameters:
  • key – Dictionary key to use in the trailer.

  • value – Value to set

pyhanko.pdf_utils.writer.init_xobject_dictionary(command_stream: bytes, box_width, box_height, resources: DictionaryObject | None = None) StreamObject

Helper function to initialise form XObject dictionaries.

Note

For utilities to handle image XObjects, see images.

Parameters:
  • command_stream – The XObject’s raw appearance stream.

  • box_width – The width of the XObject’s bounding box.

  • box_height – The height of the XObject’s bounding box.

  • resources – A resource dictionary to include with the form object.

Returns:

A StreamObject representation of the form XObject.

pyhanko.pdf_utils.writer.copy_into_new_writer(input_handler: PdfHandler, writer_kwargs: dict | None = None) PdfFileWriter

Copy all objects in a given PDF handler into a new PdfFileWriter. This operation will attempt to preserve the document catalog of the original input_handler.

Very roughly, calling this function and then immediately invoking write() on the resulting writer should result in an equivalent document as far as presentation is concerned. As a general rule, behaviour that is controlled from outside the document catalog (e.g. encryption) or that requires byte-for-byte equivalence with the original (e.g. digital signatures) will not survive this translation.

Parameters:
  • input_handlerPdfHandler to source objects from.

  • writer_kwargs – Keyword arguments to pass to the writer.

Returns:

New PdfFileWriter containing all objects from the input handler.

pyhanko.pdf_utils.xref module

Internal utilities to handle the processing of cross-reference data and document trailer data.

This entire module is considered internal API.

class pyhanko.pdf_utils.xref.XRefCache(reader, xref_sections: List[XRefSection])

Bases: object

Internal class to parse & store information from the xref section(s) of a PDF document.

Stores both the most recent status of all xrefs in addition to their historical values.

All members of this class are considered internal API and are subject to change without notice.

property total_revisions
get_last_change(ref: Reference)
object_streams_used_in(revision)
get_introducing_revision(ref: Reference)
get_xref_container_info(revision) XRefSectionMetaInfo
get_xref_data(revision) XRefSectionData
explicit_refs_in_revision(revision) Set[Reference]

Look up the object refs for all objects explicitly added or overwritten in a given revision.

Parameters:

revision – A revision number. The oldest revision is zero.

Returns:

A set of Reference objects.

refs_freed_in_revision(revision) Set[Reference]

Look up the object refs for all objects explicitly freed in a given revision.

Parameters:

revision – A revision number. The oldest revision is zero.

Returns:

A set of Reference objects.

get_startxref_for_revision(revision) int

Look up the location of the XRef table/stream associated with a specific revision, as indicated by startxref or /Prev.

Parameters:

revision – A revision number. The oldest revision is zero.

Returns:

An integer pointer

get_historical_ref(ref, revision) int | ObjStreamRef | None

Look up the location of the historical value of an object.

Note

This method is not suitable for determining whether or not a particular object ID is available in a given revision, since it treats unused objects and freed objects the same way.

Parameters:
  • ref – An object reference.

  • revision – A revision number. The oldest revision is zero.

Returns:

An integer offset, an object stream reference, or None if the reference does not resolve in the specified revision.

property hybrid_xrefs_present: bool

Determine if a file uses hybrid references anywhere.

Returns:

True if hybrid references were detected, False otherwise.

class pyhanko.pdf_utils.xref.XRefBuilder(handler: PdfHandler, stream, strict: bool, last_startxref: int)

Bases: object

err_limit = 10
read_xrefs()
class pyhanko.pdf_utils.xref.XRefType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: Enum

Different types of cross-reference entries.

FREE = 1

A freeing instruction.

STANDARD = 2

A regular top-level object.

IN_OBJ_STREAM = 3

An object that is part of an object stream.

class pyhanko.pdf_utils.xref.XRefEntry(xref_type: XRefType, location: int | ObjStreamRef | None, idnum: int, generation: int = 0)

Bases: object

Value type representing a single cross-reference entry.

xref_type: XRefType

The type of cross-reference entry.

location: int | ObjStreamRef | None

Location the cross-reference points to.

idnum: int

The ID of the object being referenced.

generation: int = 0

The generation number of the object being referenced.

class pyhanko.pdf_utils.xref.ObjStreamRef(obj_stream_id: int, ix_in_stream: int)

Bases: object

Identifies an object that’s part of an object stream.

obj_stream_id: int

The ID number of the object stream (its generation number is presumed zero).

ix_in_stream: int

The index of the object in the stream.

exception pyhanko.pdf_utils.xref.ObjectHeaderReadError(msg: str, *args)

Bases: PdfReadError

class pyhanko.pdf_utils.xref.XRefSection(meta_info: XRefSectionMetaInfo, xref_data: XRefSectionData)

Bases: object

Describes a cross-reference section and describes how it is serialised into the PDF file.

meta_info: XRefSectionMetaInfo

Metadata about the cross-reference section.

xref_data: XRefSectionData

A description of the actual object pointer definitions.

class pyhanko.pdf_utils.xref.XRefSectionData

Bases: object

Internal class for bookkeeping on a single cross-reference section, independently of the others.

try_resolve(ref: Reference | IndirectObject) int | ObjStreamRef | None
process_entries(entries: Iterator[XRefEntry], strict: bool)
process_hybrid_entries(entries: Iterator[XRefEntry], xref_meta_info: XRefSectionMetaInfo, strict: bool)
higher_generation_refs()
class pyhanko.pdf_utils.xref.XRefSectionType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: Enum

STANDARD = 1
STREAM = 2
HYBRID_MAIN = 3
HYBRID_STREAM = 4
class pyhanko.pdf_utils.xref.XRefSectionMetaInfo(xref_section_type: pyhanko.pdf_utils.xref.XRefSectionType, size: int, declared_startxref: int, start_location: int, end_location: int, stream_ref: pyhanko.pdf_utils.generic.Reference | None)

Bases: object

xref_section_type: XRefSectionType

The type of cross-reference section.

size: int

The highest object ID in scope for this xref section.

declared_startxref: int

Location pointed to by the startxref pointer in that revision.

start_location: int

Actual start location of the xref data. This should be equal to declared_startxref, but in broken files that may not be the case.

end_location: int

Location where the xref data ended.

stream_ref: Reference | None

Reference to the relevant xref stream, if applicable.

class pyhanko.pdf_utils.xref.TrailerDictionary

Bases: PdfObject

The standard mandates that each trailer shall contain at least all keys used in the preceding trailer, even if unmodified. Of course, we cannot trust documents to actually follow this rule, so this class implements fallbacks.

non_trailer_keys = {'/DecodeParms', '/Filter', '/Index', '/Length', '/Type', '/W', '/XRefStm'}
add_trailer_revision(trailer_dict: DictionaryObject)
raw_get(key, decrypt: EncryptedObjAccess = EncryptedObjAccess.TRANSPARENT, revision=None)
flatten(revision=None) DictionaryObject
keys()
items()
write_to_stream(stream, handler=None, container_ref=None)

Abstract method to render this object to an output stream.

Parameters:
  • stream – An output stream.

  • container_ref – Local encryption key.

  • handler – Security handler

pyhanko.pdf_utils.xref.read_object_header(stream, strict)
pyhanko.pdf_utils.xref.parse_xref_stream(xref_stream: StreamObject, strict: bool = True) Iterator[XRefEntry]

Parse a single cross-reference stream and yield its entries one by one.

This is internal API.

Parameters:
  • xref_stream – A StreamObject.

  • strict – Boolean indicating whether we’re running in strict mode.

Returns:

A generator object yielding XRefEntry objects.

pyhanko.pdf_utils.xref.parse_xref_table(stream) Iterator[XRefEntry]

Parse a single cross-reference table and yield its entries one by one.

This is internal API.

Parameters:

stream – A file-like object pointed to the start of the cross-reference table.

Returns:

A generator object yielding XRefEntry objects.

pyhanko.pdf_utils.xref.write_xref_table(stream, position_dict: Dict[Tuple[int, int], int])
class pyhanko.pdf_utils.xref.ObjectStream(compress=True)

Bases: object

Utility class to collect objects into a PDF object stream.

Object streams are mainly useful for space efficiency reasons. They allow related objects to be grouped & compressed together in a more flexible manner.

Warning

Object streams can only be used in files with a cross-reference stream, as opposed to a classical XRef table. In particular, this means that incremental updates to files with a legacy XRef table cannot contain object streams either. See § 7.5.7 in ISO 32000-1 for further details.

Danger

Use BasePdfFileWriter.prepare_object_stream() to create instances of object streams. The __init__ function is internal API.

add_object(idnum: int, obj: PdfObject)

Add an object to an object stream. Note that objects in object streams always have their generation number set to 0 by definition.

Parameters:
  • idnum – The object’s ID number.

  • obj – The object to embed into the object stream.

Raises:

TypeError – Raised if obj is an instance of StreamObject or IndirectObject.

as_pdf_object() StreamObject

Render the object stream to a PDF stream object

Returns:

An instance of StreamObject.

class pyhanko.pdf_utils.xref.XRefStream(position_dict: Dict[Tuple[int, int], int | Tuple[int, int]])

Bases: StreamObject

write_to_stream(stream, handler=None, container_ref=None)

Abstract method to render this object to an output stream.

Parameters:
  • stream – An output stream.

  • container_ref – Local encryption key.

  • handler – Security handler

Module contents