pyhanko.pdf_utils.generic module

Implementation of PDF object types and other generic functionality. The internals were imported from PyPDF2, with modifications.

See here for the original license of the PyPDF2 project.

class pyhanko.pdf_utils.generic.Dereferenceable

Bases: object

Represents an opaque reference to a PDF object associated with a PDF Handler (see PdfHandler).

This can either be a reference to an object with an object ID (see Reference) or a reference to the trailer of a PDF document (see TrailerReference).

get_object()pyhanko.pdf_utils.generic.PdfObject

Retrieve the PDF object backing this dereferenceable.

Returns

A PdfObject.

get_pdf_handler()

Return the PDF handler associated with this dereferenceable.

Returns

a PdfHandler.

class pyhanko.pdf_utils.generic.Reference(idnum: int, generation: int = 0, pdf: Optional[object] = None)

Bases: pyhanko.pdf_utils.generic.Dereferenceable

A reference to an object with a certain ID and generation number, with a PDF handler attached to it.

Warning

Contrary to what one might expect, the generation number does not indicate the document revision in which the object was modified. In fact, nonzero generation numbers are exceedingly rare these days; in most real-world PDF files, objects are simply overridden without ever increasing the generation number.

Except in very specific circumstances, dereferencing a Reference will return the most recent version of the object with the stated object ID and generation number.

idnum: int

The object’s ID.

generation: int = 0

The object’s generation number (usually 0)

pdf: object = None

The PDF handler associated with this reference, an instance of PdfHandler.

Warning

This field is ignored when hashing or comparing Reference objects, so it is the API user’s responsibility to not mix up references originating from unrelated PDF handlers.

get_object()pyhanko.pdf_utils.generic.PdfObject

Retrieve the PDF object backing this dereferenceable.

Returns

A PdfObject.

get_pdf_handler()

Return the PDF handler associated with this dereferenceable.

Returns

a PdfHandler.

class pyhanko.pdf_utils.generic.TrailerReference(reader)

Bases: pyhanko.pdf_utils.generic.Dereferenceable

A reference to the trailer of a PDF document.

Warning

Since the trailer does not have a well-defined object ID in files with “classical” cross-reference tables (as opposed to cross-reference streams), this is not a subclass of Reference.

get_object()pyhanko.pdf_utils.generic.PdfObject

Retrieve the PDF object backing this dereferenceable.

Returns

A PdfObject.

get_pdf_handler()

Return the PDF handler associated with this dereferenceable.

Returns

a PdfHandler.

class pyhanko.pdf_utils.generic.PdfObject

Bases: object

Superclass for all PDF objects.

container_ref: pyhanko.pdf_utils.generic.Dereferenceable = None

For objects read from a file, container_ref points to the unique addressable object containing this object.

Note

Consider the following object definition in a PDF file:

4 0 obj
<< /Foo (Bar) >>

This declares a dictionary with ID 4, but the values /Foo and (Bar) are also PDF objects (a name and a string, respectively). All of these will have container_ref given by a Reference with object ID 4 and generation number 0.

If an object is part of the trailer of a PDF file, container_ref will be a TrailerReference. For newly created objects (i.e. those not read from a file), container_ref is always None.

get_container_ref()pyhanko.pdf_utils.generic.Dereferenceable

Return a reference to the closest parent object containing this object. Raises an error if no such reference can be found.

get_object()

Resolves indirect references.

Returns

self, unless an instance of IndirectObject.

write_to_stream(stream, handler=None, container_ref: Optional[pyhanko.pdf_utils.generic.Reference] = None)

Abstract method to render this object to an output stream.

Parameters
  • stream – An output stream.

  • container_ref – Local encryption key.

  • handler – Security handler

class pyhanko.pdf_utils.generic.IndirectObject(idnum, generation, pdf)

Bases: pyhanko.pdf_utils.generic.PdfObject, pyhanko.pdf_utils.generic.Dereferenceable

Thin wrapper around a Reference, implementing both the Dereferenceable and PdfObject interfaces.

Warning

For many purposes, this class is functionally interchangeable with Reference, with one important exception: IndirectObject instances pointing to the same reference but occurring at different locations in the file may have distinct container_ref values.

get_object()
Returns

The PDF object this reference points to.

get_pdf_handler()

Return the PDF handler associated with this dereferenceable.

Returns

a PdfHandler.

property idnum: int
Returns

the object ID of this reference.

property generation
Returns

the generation number of this reference.

write_to_stream(stream, handler=None, container_ref=None)

Abstract method to render this object to an output stream.

Parameters
  • stream – An output stream.

  • container_ref – Local encryption key.

  • handler – Security handler

static read_from_stream(stream, container_ref: pyhanko.pdf_utils.generic.Dereferenceable)
class pyhanko.pdf_utils.generic.NullObject

Bases: pyhanko.pdf_utils.generic.PdfObject

PDF null object.

All instances are treated as equal and falsy.

write_to_stream(stream, handler=None, container_ref=None)

Abstract method to render this object to an output stream.

Parameters
  • stream – An output stream.

  • container_ref – Local encryption key.

  • handler – Security handler

static read_from_stream(stream)
class pyhanko.pdf_utils.generic.BooleanObject(value)

Bases: pyhanko.pdf_utils.generic.PdfObject

PDF boolean value.

write_to_stream(stream, handler=None, container_ref=None)

Abstract method to render this object to an output stream.

Parameters
  • stream – An output stream.

  • container_ref – Local encryption key.

  • handler – Security handler

static read_from_stream(stream)
class pyhanko.pdf_utils.generic.FloatObject(value='0', context=None)

Bases: decimal.Decimal, pyhanko.pdf_utils.generic.PdfObject

PDF Float object.

Internally, these are treated as decimals (and therefore actually fixed-point objects, to be precise).

as_numeric()
Returns

a Python float value for this object.

write_to_stream(stream, handler=None, container_ref=None)

Abstract method to render this object to an output stream.

Parameters
  • stream – An output stream.

  • container_ref – Local encryption key.

  • handler – Security handler

class pyhanko.pdf_utils.generic.NumberObject(value)

Bases: int, pyhanko.pdf_utils.generic.PdfObject

PDF number object. This is the PDF type for integer values.

NumberPattern = re.compile(b'[^+-.0-9]')
ByteDot = b'.'
as_numeric()
Returns

a Python int value for this object.

write_to_stream(stream, handler=None, container_ref=None)

Abstract method to render this object to an output stream.

Parameters
  • stream – An output stream.

  • container_ref – Local encryption key.

  • handler – Security handler

static read_from_stream(stream)
class pyhanko.pdf_utils.generic.ByteStringObject

Bases: bytes, pyhanko.pdf_utils.generic.PdfObject

PDF bytestring class.

property original_bytes

For compatibility with TextStringObject.original_bytes

write_to_stream(stream, handler=None, container_ref=None)

Abstract method to render this object to an output stream.

Parameters
  • stream – An output stream.

  • container_ref – Local encryption key.

  • handler – Security handler

class pyhanko.pdf_utils.generic.TextStringObject

Bases: str, pyhanko.pdf_utils.generic.PdfObject

PDF text string object.

autodetect_pdfdocencoding = False

If True, this string was determined to be encoded in PDFDoc encoding.

autodetect_utf16 = False

If True, this string was determined to be encoded in UTF16-BE encoding.

property original_bytes

Retrieve the original bytes of the string as specified in the source file.

This may be necessary if this string was misidentified as a text string.

write_to_stream(stream, handler=None, container_ref=None)

Abstract method to render this object to an output stream.

Parameters
  • stream – An output stream.

  • container_ref – Local encryption key.

  • handler – Security handler

class pyhanko.pdf_utils.generic.NameObject

Bases: str, pyhanko.pdf_utils.generic.PdfObject

PDF name object. These are valid Python strings, but names and strings are treated differently in the PDF specification, so proper care is required.

DELIMITER_PATTERN = re.compile(b'\\s+|[\\(\\)<>\\[\\]{}/%]')
write_to_stream(stream, handler=None, container_ref=None)

Abstract method to render this object to an output stream.

Parameters
  • stream – An output stream.

  • container_ref – Local encryption key.

  • handler – Security handler

static read_from_stream(stream, strict=True)
class pyhanko.pdf_utils.generic.ArrayObject(iterable=(), /)

Bases: list, pyhanko.pdf_utils.generic.PdfObject

PDF array object. This class extends from Python’s list class, and supports its interface.

Warning

Contrary to the case of dictionary objects, PyPDF2 does not transparently dereference array entries when accessed using __getitem__(). For usability & consistency reasons, I decided to depart from that and dereference automatically. This makes the behaviour of ArrayObject consistent with DictionaryObject.

That said, some vestiges of the old PyPDF2 behaviour may linger in the codebase. I’ll fix those as I get to them.

raw_get(index, decrypt=True)

Get a value from an array without dereferencing. In other words, if the value corresponding to the given key is of type IndirectObject, the indirect reference will not be resolved.

Parameters
  • index – Key to look up in the dictionary.

  • decrypt – If False, instances of DecryptedObjectProxy will be returned as-is. If True, they will be decrypted. Default True.

Returns

A PdfObject.

write_to_stream(stream, handler=None, container_ref=None)

Abstract method to render this object to an output stream.

Parameters
  • stream – An output stream.

  • container_ref – Local encryption key.

  • handler – Security handler

static read_from_stream(stream, container_ref)
class pyhanko.pdf_utils.generic.DictionaryObject(dict_data=None)

Bases: dict, pyhanko.pdf_utils.generic.PdfObject

A PDF dictionary object.

Keys in a PDF dictionary are PDF names, and values are PDF objects.

When accessing a key using the standard __getitem__() syntax, IndirectObject references will be resolved.

raw_get(key, decrypt=True)

Get a value from a dictionary without dereferencing. In other words, if the value corresponding to the given key is of type IndirectObject, the indirect reference will not be resolved.

Parameters
  • key – Key to look up in the dictionary.

  • decrypt – If False, instances of DecryptedObjectProxy will be returned as-is. If True, they will be decrypted. Default True.

Returns

A PdfObject.

setdefault(key, value=None)

Insert key with a value of default if key is not in the dictionary.

Return the value for key if key is in the dictionary, else default.

get_and_apply(key, function: Callable[[pyhanko.pdf_utils.generic.PdfObject], Any], *, raw=False, default=None)
get_value_as_reference(key, optional=False)pyhanko.pdf_utils.generic.Reference
write_to_stream(stream, handler=None, container_ref=None)

Abstract method to render this object to an output stream.

Parameters
  • stream – An output stream.

  • container_ref – Local encryption key.

  • handler – Security handler

static read_from_stream(stream, container_ref: pyhanko.pdf_utils.generic.Dereferenceable)
class pyhanko.pdf_utils.generic.StreamObject(dict_data=None, stream_data=None, encoded_data=None, handler=None)

Bases: pyhanko.pdf_utils.generic.DictionaryObject

PDF stream object.

Essentially, a PDF stream is a dictionary object with a binary blob of data attached. This data can be encoded by various filters (not all of which are currently supported, see filters).

A stream object can be initialised with encoded or decoded data. The former is used by reader.PdfFileReader to provide on-demand decoding, with writer.BasePdfFileWriter and its subclasses working the other way around.

Note that the StreamObject class manages some of its dictionary keys by itself. This is partly the case for the various /Filter and /DecodeParms entries, but also for the /Length entry. The latter will be overwritten as necessary.

add_crypt_filter(name='/Identity', params=None, handler=None)
strip_filters()

Ensure the stream is decoded, and remove any filters.

property data: bytes

Return the decoded stream data as bytes. If the stream hasn’t been decoded yet, it will be decoded on-the-fly.

Raises

misc.PdfStreamError – If the stream could not be decoded.

property encoded_data: bytes

Return the encoded stream data as bytes. If the stream hasn’t been encoded yet, it will be encoded on-the-fly.

Raises

misc.PdfStreamError – If the stream could not be encoded.

apply_filter(filter_name, params=None, allow_duplicates: Optional[bool] = True)

Apply a new filter to this stream. This filter will be prepended to any existing filters. This means that is is placed last in the encoding order, but first in the decoding order.

Note: Calling this method on an encoded stream will first cause the stream to be decoded using the filters already present. The cached value for the encoded stream data will be cleared.

Parameters
  • filter_name – Name of the filter (see DECODERS)

  • params – Parameters to the filter (will be written to /DecodeParms if not None)

  • allow_duplicates – If None, silently ignore duplicate filters. If False, raise ValueError when attempting to add a duplicate filter. If True (default), duplicate filters are allowed.

compress()

Convenience method to add a /FlateDecode filter with default settings, if one is not already present.

Note: compression is not actually applied until the stream is written.

write_to_stream(stream, handler=None, container_ref=None)

Abstract method to render this object to an output stream.

Parameters
  • stream – An output stream.

  • container_ref – Local encryption key.

  • handler – Security handler

pyhanko.pdf_utils.generic.read_object(stream, container_ref: pyhanko.pdf_utils.generic.Dereferenceable)pyhanko.pdf_utils.generic.PdfObject

Read a PDF object from an input stream.

Note

The container_ref parameter tells the API which reference to register when the returned object is modified in an incremental update. See also here here for further information.

Parameters
  • stream – An input stream.

  • container_ref

    A reference to an object containing this one.

    Note: It is perfectly possible (and common) for container_ref to resolve to the return value of this function.

Returns

A PdfObject.

pyhanko.pdf_utils.generic.pdf_name

alias of pyhanko.pdf_utils.generic.NameObject

pyhanko.pdf_utils.generic.pdf_string(string)Union[pyhanko.pdf_utils.generic.ByteStringObject, pyhanko.pdf_utils.generic.TextStringObject]

Encode a string as a TextStringObject if possible, or a ByteStringObject otherwise.

Parameters

string – A Python string.

pyhanko.pdf_utils.generic.pdf_date(dt: datetime.datetime)pyhanko.pdf_utils.generic.TextStringObject

Convert a datetime object into a PDF string. This funciton supports both timezone-aware and naive datetimes.

Parameters

dt – The datetime object to convert.

Returns

A TextStringObject representing the datetime passed in.