pyhanko.pdf_utils.generic module¶
Implementation of PDF object types and other generic functionality. The internals were imported from PyPDF2, with modifications.
See here for the original license of the PyPDF2 project.
- class pyhanko.pdf_utils.generic.Dereferenceable¶
Bases:
object
Represents an opaque reference to a PDF object associated with a PDF Handler (see
PdfHandler
).This can either be a reference to an object with an object ID (see
Reference
) or a reference to the trailer of a PDF document (seeTrailerReference
).- get_object() pyhanko.pdf_utils.generic.PdfObject ¶
Retrieve the PDF object backing this dereferenceable.
- Returns
A
PdfObject
.
- get_pdf_handler()¶
Return the PDF handler associated with this dereferenceable.
- Returns
a
PdfHandler
.
- class pyhanko.pdf_utils.generic.Reference(idnum: int, generation: int = 0, pdf: Optional[object] = None)¶
Bases:
pyhanko.pdf_utils.generic.Dereferenceable
A reference to an object with a certain ID and generation number, with a PDF handler attached to it.
Warning
Contrary to what one might expect, the generation number does not indicate the document revision in which the object was modified. In fact, nonzero generation numbers are exceedingly rare these days; in most real-world PDF files, objects are simply overridden without ever increasing the generation number.
Except in very specific circumstances, dereferencing a
Reference
will return the most recent version of the object with the stated object ID and generation number.- idnum: int¶
The object’s ID.
- generation: int = 0¶
The object’s generation number (usually 0)
- pdf: object = None¶
The PDF handler associated with this reference, an instance of
PdfHandler
.Warning
This field is ignored when hashing or comparing
Reference
objects, so it is the API user’s responsibility to not mix up references originating from unrelated PDF handlers.
- get_object() pyhanko.pdf_utils.generic.PdfObject ¶
Retrieve the PDF object backing this dereferenceable.
- Returns
A
PdfObject
.
- get_pdf_handler()¶
Return the PDF handler associated with this dereferenceable.
- Returns
a
PdfHandler
.
- class pyhanko.pdf_utils.generic.TrailerReference(reader)¶
Bases:
pyhanko.pdf_utils.generic.Dereferenceable
A reference to the trailer of a PDF document.
Warning
Since the trailer does not have a well-defined object ID in files with “classical” cross-reference tables (as opposed to cross-reference streams), this is not a subclass of
Reference
.- get_object() pyhanko.pdf_utils.generic.PdfObject ¶
Retrieve the PDF object backing this dereferenceable.
- Returns
A
PdfObject
.
- get_pdf_handler()¶
Return the PDF handler associated with this dereferenceable.
- Returns
a
PdfHandler
.
- class pyhanko.pdf_utils.generic.PdfObject¶
Bases:
object
Superclass for all PDF objects.
- container_ref: pyhanko.pdf_utils.generic.Dereferenceable = None¶
For objects read from a file, container_ref points to the unique addressable object containing this object.
Note
Consider the following object definition in a PDF file:
4 0 obj << /Foo (Bar) >>
This declares a dictionary with ID 4, but the values
/Foo
and(Bar)
are also PDF objects (a name and a string, respectively). All of these will have container_ref given by aReference
with object ID 4 and generation number 0.If an object is part of the trailer of a PDF file, container_ref will be a
TrailerReference
. For newly created objects (i.e. those not read from a file), container_ref is alwaysNone
.
- get_container_ref() pyhanko.pdf_utils.generic.Dereferenceable ¶
Return a reference to the closest parent object containing this object. Raises an error if no such reference can be found.
- get_object()¶
Resolves indirect references.
- Returns
self, unless an instance of
IndirectObject
.
- write_to_stream(stream, handler=None, container_ref: Optional[pyhanko.pdf_utils.generic.Reference] = None)¶
Abstract method to render this object to an output stream.
- Parameters
stream – An output stream.
container_ref – Local encryption key.
handler – Security handler
- class pyhanko.pdf_utils.generic.IndirectObject(idnum, generation, pdf)¶
Bases:
pyhanko.pdf_utils.generic.PdfObject
,pyhanko.pdf_utils.generic.Dereferenceable
Thin wrapper around a
Reference
, implementing both theDereferenceable
andPdfObject
interfaces.Warning
For many purposes, this class is functionally interchangeable with
Reference
, with one important exception:IndirectObject
instances pointing to the same reference but occurring at different locations in the file may have distinct container_ref values.- get_object()¶
- Returns
The PDF object this reference points to.
- get_pdf_handler()¶
Return the PDF handler associated with this dereferenceable.
- Returns
a
PdfHandler
.
- property idnum: int¶
- Returns
the object ID of this reference.
- property generation¶
- Returns
the generation number of this reference.
- write_to_stream(stream, handler=None, container_ref=None)¶
Abstract method to render this object to an output stream.
- Parameters
stream – An output stream.
container_ref – Local encryption key.
handler – Security handler
- static read_from_stream(stream, container_ref: pyhanko.pdf_utils.generic.Dereferenceable)¶
- class pyhanko.pdf_utils.generic.NullObject¶
Bases:
pyhanko.pdf_utils.generic.PdfObject
PDF null object.
All instances are treated as equal and falsy.
- write_to_stream(stream, handler=None, container_ref=None)¶
Abstract method to render this object to an output stream.
- Parameters
stream – An output stream.
container_ref – Local encryption key.
handler – Security handler
- static read_from_stream(stream)¶
- class pyhanko.pdf_utils.generic.BooleanObject(value)¶
Bases:
pyhanko.pdf_utils.generic.PdfObject
PDF boolean value.
- write_to_stream(stream, handler=None, container_ref=None)¶
Abstract method to render this object to an output stream.
- Parameters
stream – An output stream.
container_ref – Local encryption key.
handler – Security handler
- static read_from_stream(stream)¶
- class pyhanko.pdf_utils.generic.FloatObject(value='0', context=None)¶
Bases:
decimal.Decimal
,pyhanko.pdf_utils.generic.PdfObject
PDF Float object.
Internally, these are treated as decimals (and therefore actually fixed-point objects, to be precise).
- as_numeric()¶
- Returns
a Python
float
value for this object.
- write_to_stream(stream, handler=None, container_ref=None)¶
Abstract method to render this object to an output stream.
- Parameters
stream – An output stream.
container_ref – Local encryption key.
handler – Security handler
- class pyhanko.pdf_utils.generic.NumberObject(value)¶
Bases:
int
,pyhanko.pdf_utils.generic.PdfObject
PDF number object. This is the PDF type for integer values.
- NumberPattern = re.compile(b'[^+-.0-9]')¶
- ByteDot = b'.'¶
- as_numeric()¶
- Returns
a Python
int
value for this object.
- write_to_stream(stream, handler=None, container_ref=None)¶
Abstract method to render this object to an output stream.
- Parameters
stream – An output stream.
container_ref – Local encryption key.
handler – Security handler
- static read_from_stream(stream)¶
- class pyhanko.pdf_utils.generic.ByteStringObject¶
Bases:
bytes
,pyhanko.pdf_utils.generic.PdfObject
PDF bytestring class.
- property original_bytes¶
For compatibility with
TextStringObject.original_bytes
- write_to_stream(stream, handler=None, container_ref=None)¶
Abstract method to render this object to an output stream.
- Parameters
stream – An output stream.
container_ref – Local encryption key.
handler – Security handler
- class pyhanko.pdf_utils.generic.TextStringObject¶
Bases:
str
,pyhanko.pdf_utils.generic.PdfObject
PDF text string object.
- autodetect_pdfdocencoding = False¶
If
True
, this string was determined to be encoded in PDFDoc encoding.
- autodetect_utf16 = False¶
If
True
, this string was determined to be encoded in UTF16-BE encoding.
- property original_bytes¶
Retrieve the original bytes of the string as specified in the source file.
This may be necessary if this string was misidentified as a text string.
- write_to_stream(stream, handler=None, container_ref=None)¶
Abstract method to render this object to an output stream.
- Parameters
stream – An output stream.
container_ref – Local encryption key.
handler – Security handler
- class pyhanko.pdf_utils.generic.NameObject¶
Bases:
str
,pyhanko.pdf_utils.generic.PdfObject
PDF name object. These are valid Python strings, but names and strings are treated differently in the PDF specification, so proper care is required.
- DELIMITER_PATTERN = re.compile(b'\\s+|[\\(\\)<>\\[\\]{}/%]')¶
- write_to_stream(stream, handler=None, container_ref=None)¶
Abstract method to render this object to an output stream.
- Parameters
stream – An output stream.
container_ref – Local encryption key.
handler – Security handler
- static read_from_stream(stream)¶
- class pyhanko.pdf_utils.generic.ArrayObject(iterable=(), /)¶
Bases:
list
,pyhanko.pdf_utils.generic.PdfObject
PDF array object. This class extends from Python’s list class, and supports its interface.
Warning
Contrary to the case of dictionary objects, PyPDF2 does not transparently dereference array entries when accessed using
__getitem__()
. For usability & consistency reasons, I decided to depart from that and dereference automatically. This makes the behaviour ofArrayObject
consistent withDictionaryObject
.That said, some vestiges of the old PyPDF2 behaviour may linger in the codebase. I’ll fix those as I get to them.
- raw_get(index, decrypt=True)¶
Get a value from an array without dereferencing. In other words, if the value corresponding to the given key is of type
IndirectObject
, the indirect reference will not be resolved.- Parameters
index – Key to look up in the dictionary.
decrypt – If
False
, instances ofDecryptedObjectProxy
will be returned as-is. IfTrue
, they will be decrypted. DefaultTrue
.
- Returns
A
PdfObject
.
- write_to_stream(stream, handler=None, container_ref=None)¶
Abstract method to render this object to an output stream.
- Parameters
stream – An output stream.
container_ref – Local encryption key.
handler – Security handler
- static read_from_stream(stream, container_ref)¶
- class pyhanko.pdf_utils.generic.DictionaryObject(dict_data=None)¶
Bases:
dict
,pyhanko.pdf_utils.generic.PdfObject
A PDF dictionary object.
Keys in a PDF dictionary are PDF names, and values are PDF objects.
When accessing a key using the standard
__getitem__()
syntax,IndirectObject
references will be resolved.- raw_get(key, decrypt=True)¶
Get a value from a dictionary without dereferencing. In other words, if the value corresponding to the given key is of type
IndirectObject
, the indirect reference will not be resolved.- Parameters
key – Key to look up in the dictionary.
decrypt – If
False
, instances ofDecryptedObjectProxy
will be returned as-is. IfTrue
, they will be decrypted. DefaultTrue
.
- Returns
A
PdfObject
.
- setdefault(key, value=None)¶
Insert key with a value of default if key is not in the dictionary.
Return the value for key if key is in the dictionary, else default.
- get_and_apply(key, function: Callable[[pyhanko.pdf_utils.generic.PdfObject], Any], *, raw=False, default=None)¶
- get_value_as_reference(key, optional=False) pyhanko.pdf_utils.generic.Reference ¶
- write_to_stream(stream, handler=None, container_ref=None)¶
Abstract method to render this object to an output stream.
- Parameters
stream – An output stream.
container_ref – Local encryption key.
handler – Security handler
- static read_from_stream(stream, container_ref: pyhanko.pdf_utils.generic.Dereferenceable)¶
- class pyhanko.pdf_utils.generic.StreamObject(dict_data=None, stream_data=None, encoded_data=None, handler=None)¶
Bases:
pyhanko.pdf_utils.generic.DictionaryObject
PDF stream object.
Essentially, a PDF stream is a dictionary object with a binary blob of data attached. This data can be encoded by various filters (not all of which are currently supported, see
filters
).A stream object can be initialised with encoded or decoded data. The former is used by
reader.PdfFileReader
to provide on-demand decoding, withwriter.BasePdfFileWriter
and its subclasses working the other way around.Note that the
StreamObject
class manages some of its dictionary keys by itself. This is partly the case for the various/Filter
and/DecodeParms
entries, but also for the/Length
entry. The latter will be overwritten as necessary.- add_crypt_filter(name='/Identity', params=None, handler=None)¶
- strip_filters()¶
Ensure the stream is decoded, and remove any filters.
- property data: bytes¶
Return the decoded stream data as bytes. If the stream hasn’t been decoded yet, it will be decoded on-the-fly.
- Raises
misc.PdfStreamError – If the stream could not be decoded.
- property encoded_data: bytes¶
Return the encoded stream data as bytes. If the stream hasn’t been encoded yet, it will be encoded on-the-fly.
- Raises
misc.PdfStreamError – If the stream could not be encoded.
- apply_filter(filter_name, params=None, allow_duplicates: Optional[bool] = True)¶
Apply a new filter to this stream. This filter will be prepended to any existing filters. This means that is is placed last in the encoding order, but first in the decoding order.
Note: Calling this method on an encoded stream will first cause the stream to be decoded using the filters already present. The cached value for the encoded stream data will be cleared.
- Parameters
filter_name – Name of the filter (see
DECODERS
)params – Parameters to the filter (will be written to
/DecodeParms
if notNone
)allow_duplicates – If
None
, silently ignore duplicate filters. IfFalse
, raise ValueError when attempting to add a duplicate filter. IfTrue
(default), duplicate filters are allowed.
- compress()¶
Convenience method to add a
/FlateDecode
filter with default settings, if one is not already present.Note: compression is not actually applied until the stream is written.
- property is_embedded_file_stream¶
- write_to_stream(stream, handler=None, container_ref=None)¶
Abstract method to render this object to an output stream.
- Parameters
stream – An output stream.
container_ref – Local encryption key.
handler – Security handler
- pyhanko.pdf_utils.generic.read_object(stream, container_ref: pyhanko.pdf_utils.generic.Dereferenceable) pyhanko.pdf_utils.generic.PdfObject ¶
Read a PDF object from an input stream.
Note
The container_ref parameter tells the API which reference to register when the returned object is modified in an incremental update. See also here here for further information.
- Parameters
stream – An input stream.
container_ref –
A reference to an object containing this one.
Note: It is perfectly possible (and common) for container_ref to resolve to the return value of this function.
- Returns
A
PdfObject
.
- pyhanko.pdf_utils.generic.pdf_name¶
alias of
pyhanko.pdf_utils.generic.NameObject
- pyhanko.pdf_utils.generic.pdf_string(string) Union[pyhanko.pdf_utils.generic.ByteStringObject, pyhanko.pdf_utils.generic.TextStringObject] ¶
Encode a string as a
TextStringObject
if possible, or aByteStringObject
otherwise.- Parameters
string – A Python string.
- pyhanko.pdf_utils.generic.pdf_date(dt: datetime.datetime) pyhanko.pdf_utils.generic.TextStringObject ¶
Convert a datetime object into a PDF string. This funciton supports both timezone-aware and naive datetimes.
- Parameters
dt – The datetime object to convert.
- Returns
A
TextStringObject
representing the datetime passed in.