pyhanko.pdf_utils.generic module

Implementation of PDF object types and other generic functionality. The internals were imported from PyPDF2, with modifications.

See here for the original license of the PyPDF2 project.

class pyhanko.pdf_utils.generic.Dereferenceable

Bases: object

Represents an opaque reference to a PDF object associated with a PDF Handler (see PdfHandler).

This can either be a reference to an object with an object ID (see Reference) or a reference to the trailer of a PDF document (see TrailerReference).

get_object() PdfObject

Retrieve the PDF object backing this dereferenceable.

Returns

A PdfObject.

get_pdf_handler()

Return the PDF handler associated with this dereferenceable.

Returns

a PdfHandler.

class pyhanko.pdf_utils.generic.Reference(idnum: int, generation: int = 0, pdf: Optional[object] = None)

Bases: Dereferenceable

A reference to an object with a certain ID and generation number, with a PDF handler attached to it.

Warning

Contrary to what one might expect, the generation number does not indicate the document revision in which the object was modified. In fact, nonzero generation numbers are exceedingly rare these days; in most real-world PDF files, objects are simply overridden without ever increasing the generation number.

Except in very specific circumstances, dereferencing a Reference will return the most recent version of the object with the stated object ID and generation number.

idnum: int

The object’s ID.

generation: int = 0

The object’s generation number (usually 0)

pdf: object = None

The PDF handler associated with this reference, an instance of PdfHandler.

Warning

This field is ignored when hashing or comparing Reference objects, so it is the API user’s responsibility to not mix up references originating from unrelated PDF handlers.

get_object() PdfObject

Retrieve the PDF object backing this dereferenceable.

Returns

A PdfObject.

get_pdf_handler()

Return the PDF handler associated with this dereferenceable.

Returns

a PdfHandler.

class pyhanko.pdf_utils.generic.TrailerReference(reader)

Bases: Dereferenceable

A reference to the trailer of a PDF document.

Warning

Since the trailer does not have a well-defined object ID in files with “classical” cross-reference tables (as opposed to cross-reference streams), this is not a subclass of Reference.

Parameters

reader – a PdfFileReader

get_object() PdfObject

Retrieve the PDF object backing this dereferenceable.

Returns

A PdfObject.

get_pdf_handler()

Return the PDF handler associated with this dereferenceable.

Returns

a PdfHandler.

class pyhanko.pdf_utils.generic.PdfObject

Bases: object

Superclass for all PDF objects.

container_ref: Dereferenceable = None

For objects read from a file, container_ref points to the unique addressable object containing this object.

Note

Consider the following object definition in a PDF file:

4 0 obj
<< /Foo (Bar) >>

This declares a dictionary with ID 4, but the values /Foo and (Bar) are also PDF objects (a name and a string, respectively). All of these will have container_ref given by a Reference with object ID 4 and generation number 0.

If an object is part of the trailer of a PDF file, container_ref will be a TrailerReference. For newly created objects (i.e. those not read from a file), container_ref is always None.

get_container_ref() Dereferenceable

Return a reference to the closest parent object containing this object. Raises an error if no such reference can be found.

get_object()

Resolves indirect references.

Returns

self, unless an instance of IndirectObject.

write_to_stream(stream, handler=None, container_ref: Optional[Reference] = None)

Abstract method to render this object to an output stream.

Parameters
  • stream – An output stream.

  • container_ref – Local encryption key.

  • handler – Security handler

class pyhanko.pdf_utils.generic.IndirectObject(idnum, generation, pdf)

Bases: PdfObject, Dereferenceable

Thin wrapper around a Reference, implementing both the Dereferenceable and PdfObject interfaces.

Warning

For many purposes, this class is functionally interchangeable with Reference, with one important exception: IndirectObject instances pointing to the same reference but occurring at different locations in the file may have distinct container_ref values.

get_object()
Returns

The PDF object this reference points to.

get_pdf_handler()

Return the PDF handler associated with this dereferenceable.

Returns

a PdfHandler.

property idnum: int
Returns

the object ID of this reference.

property generation
Returns

the generation number of this reference.

write_to_stream(stream, handler=None, container_ref=None)

Abstract method to render this object to an output stream.

Parameters
  • stream – An output stream.

  • container_ref – Local encryption key.

  • handler – Security handler

static read_from_stream(stream, container_ref: Dereferenceable)
class pyhanko.pdf_utils.generic.NullObject

Bases: PdfObject

PDF null object.

All instances are treated as equal and falsy.

write_to_stream(stream, handler=None, container_ref=None)

Abstract method to render this object to an output stream.

Parameters
  • stream – An output stream.

  • container_ref – Local encryption key.

  • handler – Security handler

static read_from_stream(stream)
class pyhanko.pdf_utils.generic.BooleanObject(value)

Bases: PdfObject

PDF boolean value.

write_to_stream(stream, handler=None, container_ref=None)

Abstract method to render this object to an output stream.

Parameters
  • stream – An output stream.

  • container_ref – Local encryption key.

  • handler – Security handler

static read_from_stream(stream)
class pyhanko.pdf_utils.generic.FloatObject(value='0', context=None)

Bases: Decimal, PdfObject

PDF Float object.

Internally, these are treated as decimals (and therefore actually fixed-point objects, to be precise).

as_numeric()
Returns

a Python float value for this object.

write_to_stream(stream, handler=None, container_ref=None)

Abstract method to render this object to an output stream.

Parameters
  • stream – An output stream.

  • container_ref – Local encryption key.

  • handler – Security handler

class pyhanko.pdf_utils.generic.NumberObject(value)

Bases: int, PdfObject

PDF number object. This is the PDF type for integer values.

NumberPattern = re.compile(b'[^+-.0-9]')
ByteDot = b'.'
as_numeric()
Returns

a Python int value for this object.

write_to_stream(stream, handler=None, container_ref=None)

Abstract method to render this object to an output stream.

Parameters
  • stream – An output stream.

  • container_ref – Local encryption key.

  • handler – Security handler

static read_from_stream(stream)
class pyhanko.pdf_utils.generic.ByteStringObject

Bases: bytes, PdfObject

PDF bytestring class.

property original_bytes

For compatibility with TextStringObject.original_bytes

write_to_stream(stream, handler=None, container_ref=None)

Abstract method to render this object to an output stream.

Parameters
  • stream – An output stream.

  • container_ref – Local encryption key.

  • handler – Security handler

class pyhanko.pdf_utils.generic.TextStringObject

Bases: str, PdfObject

PDF text string object.

autodetected_encoding: Optional[TextStringEncoding] = None

Autodetected encoding when parsing the file.

force_output_encoding: Optional[TextStringEncoding] = None

Output encoding to use when serialising the string. The default is to try PDFDocEncoding first, and fall back to UTF-16BE.

property original_bytes

Retrieve the original bytes of the string as specified in the source file.

This may be necessary if this string was misidentified as a text string.

write_to_stream(stream, handler=None, container_ref=None)

Abstract method to render this object to an output stream.

Parameters
  • stream – An output stream.

  • container_ref – Local encryption key.

  • handler – Security handler

class pyhanko.pdf_utils.generic.NameObject

Bases: str, PdfObject

PDF name object. These are valid Python strings, but names and strings are treated differently in the PDF specification, so proper care is required.

DELIMITER_PATTERN = re.compile(b'\\s|[\\(\\)<>\\[\\]{}/%]|\x00')
write_to_stream(stream, handler=None, container_ref=None)

Abstract method to render this object to an output stream.

Parameters
  • stream – An output stream.

  • container_ref – Local encryption key.

  • handler – Security handler

static read_from_stream(stream)
class pyhanko.pdf_utils.generic.ArrayObject(iterable=(), /)

Bases: list, PdfObject

PDF array object. This class extends from Python’s list class, and supports its interface.

Warning

Contrary to the case of dictionary objects, PyPDF2 does not transparently dereference array entries when accessed using __getitem__(). For usability & consistency reasons, I decided to depart from that and dereference automatically. This makes the behaviour of ArrayObject consistent with DictionaryObject.

That said, some vestiges of the old PyPDF2 behaviour may linger in the codebase. I’ll fix those as I get to them.

raw_get(index, decrypt: EncryptedObjAccess = EncryptedObjAccess.TRANSPARENT)

Changed in version 0.14.0: decrypt parameter is no longer boolean

Get a value from an array without dereferencing. In other words, if the value corresponding to the given key is of type IndirectObject, the indirect reference will not be resolved.

Parameters
Returns

A PdfObject.

write_to_stream(stream, handler=None, container_ref=None)

Abstract method to render this object to an output stream.

Parameters
  • stream – An output stream.

  • container_ref – Local encryption key.

  • handler – Security handler

static read_from_stream(stream, container_ref)
class pyhanko.pdf_utils.generic.DictionaryObject(dict_data=None)

Bases: dict, PdfObject

A PDF dictionary object.

Keys in a PDF dictionary are PDF names, and values are PDF objects.

When accessing a key using the standard __getitem__() syntax, IndirectObject references will be resolved.

raw_get(key, decrypt: EncryptedObjAccess = EncryptedObjAccess.TRANSPARENT)

Changed in version 0.14.0: decrypt parameter is no longer boolean

Get a value from a dictionary without dereferencing. In other words, if the value corresponding to the given key is of type IndirectObject, the indirect reference will not be resolved.

Parameters
Returns

A PdfObject.

setdefault(key, value=None)

Insert key with a value of default if key is not in the dictionary.

Return the value for key if key is in the dictionary, else default.

get_and_apply(key, function: Callable[[PdfObject], Any], *, raw=False, default=None)
get_value_as_reference(key, optional=False) Reference
write_to_stream(stream, handler=None, container_ref=None)

Abstract method to render this object to an output stream.

Parameters
  • stream – An output stream.

  • container_ref – Local encryption key.

  • handler – Security handler

static read_from_stream(stream, container_ref: Dereferenceable, as_metadata_stream: bool = False)
class pyhanko.pdf_utils.generic.StreamObject(dict_data=None, stream_data=None, encoded_data=None, handler=None)

Bases: DictionaryObject

PDF stream object.

Essentially, a PDF stream is a dictionary object with a binary blob of data attached. This data can be encoded by various filters (not all of which are currently supported, see filters).

A stream object can be initialised with encoded or decoded data. The former is used by reader.PdfFileReader to provide on-demand decoding, with writer.BasePdfFileWriter and its subclasses working the other way around.

Note

The StreamObject class manages some of its dictionary keys by itself. This is partly the case for the various /Filter and /DecodeParms entries, but also for the /Length entry. The latter will be overwritten as necessary.

Parameters
  • dict_data – The dictionary data for this stream object.

  • stream_data – The (unencoded) stream data.

  • encoded_data

    The encoded stream data.

    Warning

    Ordinarily, a stream can be initialised either from decoded and from encoded data.

    If both stream_data and encoded_data are provided, the caller is responsible for making sure that both are compatible given the currently relevant filter configuration.

  • handler – A reference to the currently active pyhanko.pdf_utils.crypt.SecurityHandler. This is only necessary if the stream requires crypt filters.

add_crypt_filter(name='/Identity', params=None, handler=None)
strip_filters()

Ensure the stream is decoded, and remove any filters.

property data: bytes

Return the decoded stream data as bytes. If the stream hasn’t been decoded yet, it will be decoded on-the-fly.

Raises

.misc.PdfStreamError – If the stream could not be decoded.

property encoded_data: bytes

Return the encoded stream data as bytes. If the stream hasn’t been encoded yet, it will be encoded on-the-fly.

Raises

.misc.PdfStreamError – If the stream could not be encoded.

apply_filter(filter_name, params=None, allow_duplicates: Optional[bool] = True)

Apply a new filter to this stream. This filter will be prepended to any existing filters. This means that is is placed last in the encoding order, but first in the decoding order.

Note: Calling this method on an encoded stream will first cause the stream to be decoded using the filters already present. The cached value for the encoded stream data will be cleared.

Parameters
  • filter_name – Name of the filter (see DECODERS)

  • params – Parameters to the filter (will be written to /DecodeParms if not None)

  • allow_duplicates – If None, silently ignore duplicate filters. If False, raise ValueError when attempting to add a duplicate filter. If True (default), duplicate filters are allowed.

compress()

Convenience method to add a /FlateDecode filter with default settings, if one is not already present.

Note: compression is not actually applied until the stream is written.

property is_embedded_file_stream
write_to_stream(stream, handler=None, container_ref=None)

Abstract method to render this object to an output stream.

Parameters
  • stream – An output stream.

  • container_ref – Local encryption key.

  • handler – Security handler

pyhanko.pdf_utils.generic.read_object(stream, container_ref: Dereferenceable, as_metadata_stream: bool = False) PdfObject

Read a PDF object from an input stream.

Note

The container_ref parameter tells the API which reference to register when the returned object is modified in an incremental update. See also here here for further information.

Parameters
  • stream – An input stream.

  • container_ref

    A reference to an object containing this one.

    Note: It is perfectly possible (and common) for container_ref to resolve to the return value of this function.

  • as_metadata_stream – Whether to dereference the object as an XMP metadata stream.

Returns

A PdfObject.

pyhanko.pdf_utils.generic.pdf_name

alias of NameObject

pyhanko.pdf_utils.generic.pdf_string(string: Union[str, bytes, bytearray]) Union[ByteStringObject, TextStringObject]

Encode a string as a TextStringObject if possible, or a ByteStringObject otherwise.

Parameters

string – A Python string.

pyhanko.pdf_utils.generic.pdf_date(dt: datetime) TextStringObject

Convert a datetime object into a PDF string. This function supports both timezone-aware and naive datetime objects.

Parameters

dt – The datetime object to convert.

Returns

A TextStringObject representing the datetime passed in.

class pyhanko.pdf_utils.generic.TextStringEncoding(value)

Bases: Enum

Encodings for PDF text strings.

PDF_DOC = None

PDFDocEncoding (one-byte character codes; PDF-specific).

UTF16BE = (b'\xfe\xff', 'utf-16be')

UTF-16BE encoding.

UTF8 = (b'\xef\xbb\xbf', 'utf-8')

UTF-8 encoding (PDF 2.0)

UTF16LE = (b'\xff\xfe', 'utf-16le')

UTF-16LE encoding.

Note

This is strictly speaking invalid in PDF 2.0, but some authoring tools output such strings anyway (presumably due to the fact that it’s the default wide character encoding on Windows).

encode(string: str) bytes

Encode a string with BOM.

Parameters

string – The string to encode.

Returns

The encoded string.

decode(string: Union[bytes, bytearray]) str

Decode a string with BOM.

Parameters

string – The string to encode.

Returns

The encoded string.

Raises

UnicodeDecodeError – Raised if decoding fails.

class pyhanko.pdf_utils.generic.EncryptedObjAccess(value)

Bases: Enum

Defines what to do when an encrypted object is encountered when retrieving an object from a container.

PROXY = 0

Return the proxy object as-is, and leave further encryption/decryption handling to the caller.

TRANSPARENT = 1

Transparently decrypt the proxy’s content (similarly wrapping any sub-containers in DecryptedObjectProxy, so this applies recursively).

Note

This is the default in most situations, since it’s the least likely to get in the way of any APIs that are not explicitly aware of content encryption concerns.

RAW = 2

Return the underlying raw object as written, without attempting or deferring decryption.

class pyhanko.pdf_utils.generic.DecryptedObjectProxy(raw_object: PdfObject, handler)

Bases: PdfObject

Internal proxy class that allows transparent on-demand encryption of objects.

Warning

Most public-facing APIs won’t leave you to deal with these directly (that’s half the reason this class exists in the first place), and the API of this class is considered internal.

However, for reasons related to the historical PyPDF2 codebase from which pyHanko’s object handling code ultimately derives, there are some Python builtins that might cause these wrapper objects to inadvertently “leak”. Please tell us about such cases so we can make those types of access more convenient and robust.

Danger

The __eq__ implementation on this class is not safe for general use, due to the fact that certain structures in PDF are exempt from encryption. Only compare proxy objects with == in areas of the document where these exemptions don’t apply.

Parameters
  • raw_object – A raw object, typically as-parsed from a PDF file.

  • handler – The security handler governing this object.

raw_object: PdfObject

The underlying raw object, in its encrypted state.

property decrypted: PdfObject

The decrypted PDF object exposed as a property.

If this object is a container object, its constituent parts will be wrapped in DecryptedObjectProxy as well, in order to defer further decryption until the values are requested through a getter method on the container.

write_to_stream(stream, handler=None, container_ref=None)

Abstract method to render this object to an output stream.

Parameters
  • stream – An output stream.

  • container_ref – Local encryption key.

  • handler – Security handler

get_object()

Resolves indirect references.

Returns

self, unless an instance of IndirectObject.

property container_ref