pyhanko.pdf_utils.generic module
Implementation of PDF object types and other generic functionality. The internals were imported from PyPDF2, with modifications.
See here for the original license of the PyPDF2 project.
- class pyhanko.pdf_utils.generic.Dereferenceable
Bases:
object
Represents an opaque reference to a PDF object associated with a PDF Handler (see
PdfHandler
).This can either be a reference to an object with an object ID (see
Reference
) or a reference to the trailer of a PDF document (seeTrailerReference
).- get_pdf_handler()
Return the PDF handler associated with this dereferenceable.
- Returns
a
PdfHandler
.
- class pyhanko.pdf_utils.generic.Reference(idnum: int, generation: int = 0, pdf: Optional[object] = None)
Bases:
Dereferenceable
A reference to an object with a certain ID and generation number, with a PDF handler attached to it.
Warning
Contrary to what one might expect, the generation number does not indicate the document revision in which the object was modified. In fact, nonzero generation numbers are exceedingly rare these days; in most real-world PDF files, objects are simply overridden without ever increasing the generation number.
Except in very specific circumstances, dereferencing a
Reference
will return the most recent version of the object with the stated object ID and generation number.- idnum: int
The object’s ID.
- generation: int = 0
The object’s generation number (usually 0)
- pdf: object = None
The PDF handler associated with this reference, an instance of
PdfHandler
.Warning
This field is ignored when hashing or comparing
Reference
objects, so it is the API user’s responsibility to not mix up references originating from unrelated PDF handlers.
- get_pdf_handler()
Return the PDF handler associated with this dereferenceable.
- Returns
a
PdfHandler
.
- class pyhanko.pdf_utils.generic.TrailerReference(reader)
Bases:
Dereferenceable
A reference to the trailer of a PDF document.
Warning
Since the trailer does not have a well-defined object ID in files with “classical” cross-reference tables (as opposed to cross-reference streams), this is not a subclass of
Reference
.- Parameters
reader – a
PdfFileReader
- get_pdf_handler()
Return the PDF handler associated with this dereferenceable.
- Returns
a
PdfHandler
.
- class pyhanko.pdf_utils.generic.PdfObject
Bases:
object
Superclass for all PDF objects.
- container_ref: Dereferenceable = None
For objects read from a file, container_ref points to the unique addressable object containing this object.
Note
Consider the following object definition in a PDF file:
4 0 obj << /Foo (Bar) >>
This declares a dictionary with ID 4, but the values
/Foo
and(Bar)
are also PDF objects (a name and a string, respectively). All of these will have container_ref given by aReference
with object ID 4 and generation number 0.If an object is part of the trailer of a PDF file, container_ref will be a
TrailerReference
. For newly created objects (i.e. those not read from a file), container_ref is alwaysNone
.
- get_container_ref() Dereferenceable
Return a reference to the closest parent object containing this object. Raises an error if no such reference can be found.
- get_object()
Resolves indirect references.
- Returns
self, unless an instance of
IndirectObject
.
- class pyhanko.pdf_utils.generic.IndirectObject(idnum, generation, pdf)
Bases:
PdfObject
,Dereferenceable
Thin wrapper around a
Reference
, implementing both theDereferenceable
andPdfObject
interfaces.Warning
For many purposes, this class is functionally interchangeable with
Reference
, with one important exception:IndirectObject
instances pointing to the same reference but occurring at different locations in the file may have distinct container_ref values.- get_object()
- Returns
The PDF object this reference points to.
- get_pdf_handler()
Return the PDF handler associated with this dereferenceable.
- Returns
a
PdfHandler
.
- property idnum: int
- Returns
the object ID of this reference.
- property generation
- Returns
the generation number of this reference.
- write_to_stream(stream, handler=None, container_ref=None)
Abstract method to render this object to an output stream.
- Parameters
stream – An output stream.
container_ref – Local encryption key.
handler – Security handler
- static read_from_stream(stream, container_ref: Dereferenceable)
- class pyhanko.pdf_utils.generic.NullObject
Bases:
PdfObject
PDF null object.
All instances are treated as equal and falsy.
- write_to_stream(stream, handler=None, container_ref=None)
Abstract method to render this object to an output stream.
- Parameters
stream – An output stream.
container_ref – Local encryption key.
handler – Security handler
- static read_from_stream(stream)
- class pyhanko.pdf_utils.generic.BooleanObject(value)
Bases:
PdfObject
PDF boolean value.
- write_to_stream(stream, handler=None, container_ref=None)
Abstract method to render this object to an output stream.
- Parameters
stream – An output stream.
container_ref – Local encryption key.
handler – Security handler
- static read_from_stream(stream)
- class pyhanko.pdf_utils.generic.FloatObject(value='0', context=None)
Bases:
Decimal
,PdfObject
PDF Float object.
Internally, these are treated as decimals (and therefore actually fixed-point objects, to be precise).
- as_numeric()
- Returns
a Python
float
value for this object.
- write_to_stream(stream, handler=None, container_ref=None)
Abstract method to render this object to an output stream.
- Parameters
stream – An output stream.
container_ref – Local encryption key.
handler – Security handler
- class pyhanko.pdf_utils.generic.NumberObject(value)
Bases:
int
,PdfObject
PDF number object. This is the PDF type for integer values.
- NumberPattern = re.compile(b'[^+-.0-9]')
- ByteDot = b'.'
- as_numeric()
- Returns
a Python
int
value for this object.
- write_to_stream(stream, handler=None, container_ref=None)
Abstract method to render this object to an output stream.
- Parameters
stream – An output stream.
container_ref – Local encryption key.
handler – Security handler
- static read_from_stream(stream)
- class pyhanko.pdf_utils.generic.ByteStringObject
Bases:
bytes
,PdfObject
PDF bytestring class.
- property original_bytes
For compatibility with
TextStringObject.original_bytes
- write_to_stream(stream, handler=None, container_ref=None)
Abstract method to render this object to an output stream.
- Parameters
stream – An output stream.
container_ref – Local encryption key.
handler – Security handler
- class pyhanko.pdf_utils.generic.TextStringObject
Bases:
str
,PdfObject
PDF text string object.
- autodetected_encoding: Optional[TextStringEncoding] = None
Autodetected encoding when parsing the file.
- force_output_encoding: Optional[TextStringEncoding] = None
Output encoding to use when serialising the string. The default is to try PDFDocEncoding first, and fall back to UTF-16BE.
- property original_bytes
Retrieve the original bytes of the string as specified in the source file.
This may be necessary if this string was misidentified as a text string.
- write_to_stream(stream, handler=None, container_ref=None)
Abstract method to render this object to an output stream.
- Parameters
stream – An output stream.
container_ref – Local encryption key.
handler – Security handler
- class pyhanko.pdf_utils.generic.NameObject
Bases:
str
,PdfObject
PDF name object. These are valid Python strings, but names and strings are treated differently in the PDF specification, so proper care is required.
- DELIMITER_PATTERN = re.compile(b'\\s|[\\(\\)<>\\[\\]{}/%]|\x00')
- write_to_stream(stream, handler=None, container_ref=None)
Abstract method to render this object to an output stream.
- Parameters
stream – An output stream.
container_ref – Local encryption key.
handler – Security handler
- static read_from_stream(stream)
- class pyhanko.pdf_utils.generic.ArrayObject(iterable=(), /)
Bases:
list
,PdfObject
PDF array object. This class extends from Python’s list class, and supports its interface.
Warning
Contrary to the case of dictionary objects, PyPDF2 does not transparently dereference array entries when accessed using
__getitem__()
. For usability & consistency reasons, I decided to depart from that and dereference automatically. This makes the behaviour ofArrayObject
consistent withDictionaryObject
.That said, some vestiges of the old PyPDF2 behaviour may linger in the codebase. I’ll fix those as I get to them.
- raw_get(index, decrypt: EncryptedObjAccess = EncryptedObjAccess.TRANSPARENT)
Changed in version 0.14.0:
decrypt
parameter is no longer booleanGet a value from an array without dereferencing. In other words, if the value corresponding to the given key is of type
IndirectObject
, the indirect reference will not be resolved.- Parameters
index – Key to look up in the dictionary.
decrypt – What to do when retrieving encrypted objects; see
EncryptedObjAccess
. The default isEncryptedObjAccess.TRANSPARENT
.
- Returns
A
PdfObject
.
- write_to_stream(stream, handler=None, container_ref=None)
Abstract method to render this object to an output stream.
- Parameters
stream – An output stream.
container_ref – Local encryption key.
handler – Security handler
- static read_from_stream(stream, container_ref)
- class pyhanko.pdf_utils.generic.DictionaryObject(dict_data=None)
Bases:
dict
,PdfObject
A PDF dictionary object.
Keys in a PDF dictionary are PDF names, and values are PDF objects.
When accessing a key using the standard
__getitem__()
syntax,IndirectObject
references will be resolved.- raw_get(key, decrypt: EncryptedObjAccess = EncryptedObjAccess.TRANSPARENT)
Changed in version 0.14.0:
decrypt
parameter is no longer booleanGet a value from a dictionary without dereferencing. In other words, if the value corresponding to the given key is of type
IndirectObject
, the indirect reference will not be resolved.- Parameters
key – Key to look up in the dictionary.
decrypt – What to do when retrieving encrypted objects; see
EncryptedObjAccess
. The default isEncryptedObjAccess.TRANSPARENT
.
- Returns
A
PdfObject
.
- setdefault(key, value=None)
Insert key with a value of default if key is not in the dictionary.
Return the value for key if key is in the dictionary, else default.
- write_to_stream(stream, handler=None, container_ref=None)
Abstract method to render this object to an output stream.
- Parameters
stream – An output stream.
container_ref – Local encryption key.
handler – Security handler
- static read_from_stream(stream, container_ref: Dereferenceable, as_metadata_stream: bool = False)
- class pyhanko.pdf_utils.generic.StreamObject(dict_data=None, stream_data=None, encoded_data=None, handler=None)
Bases:
DictionaryObject
PDF stream object.
Essentially, a PDF stream is a dictionary object with a binary blob of data attached. This data can be encoded by various filters (not all of which are currently supported, see
filters
).A stream object can be initialised with encoded or decoded data. The former is used by
reader.PdfFileReader
to provide on-demand decoding, withwriter.BasePdfFileWriter
and its subclasses working the other way around.Note
The
StreamObject
class manages some of its dictionary keys by itself. This is partly the case for the various/Filter
and/DecodeParms
entries, but also for the/Length
entry. The latter will be overwritten as necessary.- Parameters
dict_data – The dictionary data for this stream object.
stream_data – The (unencoded) stream data.
encoded_data –
The encoded stream data.
Warning
Ordinarily, a stream can be initialised either from decoded and from encoded data.
If both stream_data and encoded_data are provided, the caller is responsible for making sure that both are compatible given the currently relevant filter configuration.
handler – A reference to the currently active
pyhanko.pdf_utils.crypt.SecurityHandler
. This is only necessary if the stream requires crypt filters.
- add_crypt_filter(name='/Identity', params=None, handler=None)
- strip_filters()
Ensure the stream is decoded, and remove any filters.
- property data: bytes
Return the decoded stream data as bytes. If the stream hasn’t been decoded yet, it will be decoded on-the-fly.
- Raises
.misc.PdfStreamError – If the stream could not be decoded.
- property encoded_data: bytes
Return the encoded stream data as bytes. If the stream hasn’t been encoded yet, it will be encoded on-the-fly.
- Raises
.misc.PdfStreamError – If the stream could not be encoded.
- apply_filter(filter_name, params=None, allow_duplicates: Optional[bool] = True)
Apply a new filter to this stream. This filter will be prepended to any existing filters. This means that is is placed last in the encoding order, but first in the decoding order.
Note: Calling this method on an encoded stream will first cause the stream to be decoded using the filters already present. The cached value for the encoded stream data will be cleared.
- Parameters
filter_name – Name of the filter (see
DECODERS
)params – Parameters to the filter (will be written to
/DecodeParms
if notNone
)allow_duplicates – If
None
, silently ignore duplicate filters. IfFalse
, raise ValueError when attempting to add a duplicate filter. IfTrue
(default), duplicate filters are allowed.
- compress()
Convenience method to add a
/FlateDecode
filter with default settings, if one is not already present.Note: compression is not actually applied until the stream is written.
- property is_embedded_file_stream
- write_to_stream(stream, handler=None, container_ref=None)
Abstract method to render this object to an output stream.
- Parameters
stream – An output stream.
container_ref – Local encryption key.
handler – Security handler
- pyhanko.pdf_utils.generic.read_object(stream, container_ref: Dereferenceable, as_metadata_stream: bool = False) PdfObject
Read a PDF object from an input stream.
Note
The container_ref parameter tells the API which reference to register when the returned object is modified in an incremental update. See also here here for further information.
- Parameters
stream – An input stream.
container_ref –
A reference to an object containing this one.
Note: It is perfectly possible (and common) for container_ref to resolve to the return value of this function.
as_metadata_stream – Whether to dereference the object as an XMP metadata stream.
- Returns
A
PdfObject
.
- pyhanko.pdf_utils.generic.pdf_name
alias of
NameObject
- pyhanko.pdf_utils.generic.pdf_string(string: Union[str, bytes, bytearray]) Union[ByteStringObject, TextStringObject]
Encode a string as a
TextStringObject
if possible, or aByteStringObject
otherwise.- Parameters
string – A Python string.
- pyhanko.pdf_utils.generic.pdf_date(dt: datetime) TextStringObject
Convert a datetime object into a PDF string. This function supports both timezone-aware and naive datetime objects.
- Parameters
dt – The datetime object to convert.
- Returns
A
TextStringObject
representing the datetime passed in.
- class pyhanko.pdf_utils.generic.TextStringEncoding(value)
Bases:
Enum
Encodings for PDF text strings.
- PDF_DOC = None
PDFDocEncoding (one-byte character codes; PDF-specific).
- UTF16BE = (b'\xfe\xff', 'utf-16be')
UTF-16BE encoding.
- UTF8 = (b'\xef\xbb\xbf', 'utf-8')
UTF-8 encoding (PDF 2.0)
- UTF16LE = (b'\xff\xfe', 'utf-16le')
UTF-16LE encoding.
Note
This is strictly speaking invalid in PDF 2.0, but some authoring tools output such strings anyway (presumably due to the fact that it’s the default wide character encoding on Windows).
- encode(string: str) bytes
Encode a string with BOM.
- Parameters
string – The string to encode.
- Returns
The encoded string.
- decode(string: Union[bytes, bytearray]) str
Decode a string with BOM.
- Parameters
string – The string to encode.
- Returns
The encoded string.
- Raises
UnicodeDecodeError – Raised if decoding fails.
- class pyhanko.pdf_utils.generic.EncryptedObjAccess(value)
Bases:
Enum
Defines what to do when an encrypted object is encountered when retrieving an object from a container.
- PROXY = 0
Return the proxy object as-is, and leave further encryption/decryption handling to the caller.
- TRANSPARENT = 1
Transparently decrypt the proxy’s content (similarly wrapping any sub-containers in
DecryptedObjectProxy
, so this applies recursively).Note
This is the default in most situations, since it’s the least likely to get in the way of any APIs that are not explicitly aware of content encryption concerns.
- RAW = 2
Return the underlying raw object as written, without attempting or deferring decryption.
- class pyhanko.pdf_utils.generic.DecryptedObjectProxy(raw_object: PdfObject, handler)
Bases:
PdfObject
Internal proxy class that allows transparent on-demand encryption of objects.
Warning
Most public-facing APIs won’t leave you to deal with these directly (that’s half the reason this class exists in the first place), and the API of this class is considered internal.
However, for reasons related to the historical PyPDF2 codebase from which pyHanko’s object handling code ultimately derives, there are some Python builtins that might cause these wrapper objects to inadvertently “leak”. Please tell us about such cases so we can make those types of access more convenient and robust.
Danger
The
__eq__
implementation on this class is not safe for general use, due to the fact that certain structures in PDF are exempt from encryption. Only compare proxy objects with==
in areas of the document where these exemptions don’t apply.- Parameters
raw_object – A raw object, typically as-parsed from a PDF file.
handler – The security handler governing this object.
- property decrypted: PdfObject
The decrypted PDF object exposed as a property.
If this object is a container object, its constituent parts will be wrapped in
DecryptedObjectProxy
as well, in order to defer further decryption until the values are requested through a getter method on the container.
- write_to_stream(stream, handler=None, container_ref=None)
Abstract method to render this object to an output stream.
- Parameters
stream – An output stream.
container_ref – Local encryption key.
handler – Security handler
- get_object()
Resolves indirect references.
- Returns
self, unless an instance of
IndirectObject
.
- property container_ref