pyhanko.pdf_utils package
Subpackages
- pyhanko.pdf_utils.crypt package
- Submodules
- pyhanko.pdf_utils.crypt.api module
PdfKeyNotAvailableError
AuthStatus
PdfMacStatus
AuthResult
SecurityHandlerVersion
SecurityHandlerVersion.RC4_40
SecurityHandlerVersion.RC4_LONGER_KEYS
SecurityHandlerVersion.RC4_OR_AES128
SecurityHandlerVersion.AES256
SecurityHandlerVersion.AES_GCM
SecurityHandlerVersion.OTHER
SecurityHandlerVersion.as_pdf_object()
SecurityHandlerVersion.from_number()
SecurityHandlerVersion.check_key_length()
SecurityHandler
SecurityHandler.register()
SecurityHandler.build()
SecurityHandler.get_name()
SecurityHandler.extract_credential()
SecurityHandler.support_generic_subfilters()
SecurityHandler.instantiate_from_pdf_object()
SecurityHandler.is_authenticated()
SecurityHandler.as_pdf_object()
SecurityHandler.authenticate()
SecurityHandler.get_string_filter()
SecurityHandler.get_stream_filter()
SecurityHandler.get_embedded_file_filter()
SecurityHandler.get_file_encryption_key()
SecurityHandler.get_kdf_salt()
SecurityHandler.pdf_mac_enabled
SecurityHandler.read_cf_dictionary()
SecurityHandler.process_crypt_filters()
SecurityHandler.register_crypt_filter()
SecurityHandler.get_min_pdf_version()
SecurityHandler.get_extensions()
CryptFilter
IdentityCryptFilter
CryptFilterConfiguration
CryptFilterConfiguration.filters()
CryptFilterConfiguration.set_security_handler()
CryptFilterConfiguration.get_for_stream()
CryptFilterConfiguration.get_for_string()
CryptFilterConfiguration.get_for_embedded_file()
CryptFilterConfiguration.stream_filter_name
CryptFilterConfiguration.string_filter_name
CryptFilterConfiguration.embedded_file_filter_name
CryptFilterConfiguration.as_pdf_object()
CryptFilterConfiguration.standard_filters()
build_crypt_filter()
- pyhanko.pdf_utils.crypt.cred_ser module
- pyhanko.pdf_utils.crypt.filter_mixins module
- pyhanko.pdf_utils.crypt.permissions module
PdfPermissions
StandardPermissions
StandardPermissions.ALLOW_PRINTING
StandardPermissions.ALLOW_MODIFICATION_GENERIC
StandardPermissions.ALLOW_CONTENT_EXTRACTION
StandardPermissions.ALLOW_ANNOTS_FORM_FILLING
StandardPermissions.ALLOW_FORM_FILLING
StandardPermissions.ALLOW_ASSISTIVE_TECHNOLOGY
StandardPermissions.ALLOW_REASSEMBLY
StandardPermissions.ALLOW_HIGH_QUALITY_PRINTING
StandardPermissions.TOLERATE_MISSING_PDF_MAC
StandardPermissions.as_uint32()
StandardPermissions.mac_required()
PubKeyPermissions
PubKeyPermissions.ALLOW_ENCRYPTION_CHANGE
PubKeyPermissions.ALLOW_PRINTING
PubKeyPermissions.ALLOW_MODIFICATION_GENERIC
PubKeyPermissions.ALLOW_CONTENT_EXTRACTION
PubKeyPermissions.ALLOW_ANNOTS_FORM_FILLING
PubKeyPermissions.ALLOW_FORM_FILLING
PubKeyPermissions.ALLOW_ASSISTIVE_TECHNOLOGY
PubKeyPermissions.ALLOW_REASSEMBLY
PubKeyPermissions.ALLOW_HIGH_QUALITY_PRINTING
PubKeyPermissions.TOLERATE_MISSING_PDF_MAC
PubKeyPermissions.as_uint32()
PubKeyPermissions.mac_required()
- pyhanko.pdf_utils.crypt.pubkey module
RecipientEncryptionPolicy
PubKeyCryptFilter
PubKeyAESCryptFilter
PubKeyAESGCMCryptFilter
PubKeyRC4CryptFilter
DEFAULT_CRYPT_FILTER
DEF_EMBEDDED_FILE
PubKeyAdbeSubFilter
construct_envelope_content()
construct_recipient_cms()
InappropriateCredentialError
EnvelopeKeyDecrypter
ECCCMSSharedInfo
SimpleEnvelopeKeyDecrypter
read_envelope_key()
read_seed_from_recipient_cms()
PubKeySecurityHandler
PubKeySecurityHandler.build_from_certs()
PubKeySecurityHandler.get_name()
PubKeySecurityHandler.support_generic_subfilters()
PubKeySecurityHandler.read_cf_dictionary()
PubKeySecurityHandler.process_crypt_filters()
PubKeySecurityHandler.gather_pub_key_metadata()
PubKeySecurityHandler.instantiate_from_pdf_object()
PubKeySecurityHandler.as_pdf_object()
PubKeySecurityHandler.add_recipients()
PubKeySecurityHandler.authenticate()
PubKeySecurityHandler.get_file_encryption_key()
- pyhanko.pdf_utils.crypt.standard module
StandardSecuritySettingsRevision
StandardSecuritySettingsRevision.RC4_BASIC
StandardSecuritySettingsRevision.RC4_EXTENDED
StandardSecuritySettingsRevision.RC4_OR_AES128
StandardSecuritySettingsRevision.AES256
StandardSecuritySettingsRevision.AES_GCM
StandardSecuritySettingsRevision.OTHER
StandardSecuritySettingsRevision.as_pdf_object()
StandardSecuritySettingsRevision.from_number()
StandardCryptFilter
StandardAESCryptFilter
StandardAESGCMCryptFilter
StandardRC4CryptFilter
StandardSecurityHandler
StandardSecurityHandler.get_name()
StandardSecurityHandler.build_from_pw_legacy()
StandardSecurityHandler.build_from_pw()
StandardSecurityHandler.gather_encryption_metadata()
StandardSecurityHandler.instantiate_from_pdf_object()
StandardSecurityHandler.pdf_mac_enabled
StandardSecurityHandler.as_pdf_object()
StandardSecurityHandler.authenticate()
StandardSecurityHandler.get_file_encryption_key()
- Module contents
- pyhanko.pdf_utils.font package
- Submodules
- pyhanko.pdf_utils.font.api module
- pyhanko.pdf_utils.font.basic module
- pyhanko.pdf_utils.font.opentype module
GlyphAccumulator
GlyphAccumulatorFactory
GlyphAccumulatorFactory.font_file
GlyphAccumulatorFactory.font_size
GlyphAccumulatorFactory.ot_script_tag
GlyphAccumulatorFactory.ot_language_tag
GlyphAccumulatorFactory.writing_direction
GlyphAccumulatorFactory.bcp47_lang_code
GlyphAccumulatorFactory.create_objstream_if_needed
GlyphAccumulatorFactory.create_font_engine()
- pyhanko.pdf_utils.metadata package
- Submodules
- pyhanko.pdf_utils.metadata.info module
- pyhanko.pdf_utils.metadata.model module
DocumentMetadata
VENDOR
MetaString
ExpandedName
Qualifiers
XmpValue
XmpStructure
XmpArrayType
XmpArray
NS
XML_LANG
RDF_RDF
RDF_SEQ
RDF_BAG
RDF_ALT
RDF_LI
RDF_VALUE
RDF_RESOURCE
RDF_PARSE_TYPE
RDF_ABOUT
RDF_DESCRIPTION
DC_TITLE
DC_CREATOR
DC_DESCRIPTION
PDF_PRODUCER
PDF_KEYWORDS
X_XMPMETA
X_XMPTK
XMP_CREATORTOOL
XMP_CREATEDATE
XMP_MODDATE
- pyhanko.pdf_utils.metadata.xmp_xml module
- Module contents
Submodules
pyhanko.pdf_utils.barcodes module
- class pyhanko.pdf_utils.barcodes.BarcodeBox(barcode_type, code)
Bases:
PdfContent
Thin wrapper around python-barcode functionality.
This will render a barcode of the specified type as PDF graphics operators.
- render() bytes
Compile the content to graphics operators.
- class pyhanko.pdf_utils.barcodes.PdfStreamBarcodeWriter
Bases:
BaseWriter
Implementation of writer class for the python-barcode library to output PDF graphics operators. Note: _paint_text is intentionally dummied out. Please use the functionality implemented in pyhanko.pdf_utils.text instead.
- property command_stream: bytes
- save(filename, output)
Saves the rendered output to filename.
- Parameters:
- filenameString
Filename without extension.
- outputString
The rendered output.
- Returns:
The full filename with extension.
- Return type:
String
pyhanko.pdf_utils.content module
- class pyhanko.pdf_utils.content.ResourceType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)
Bases:
Enum
Enum listing resources that can be used as keys in a resource dictionary.
See ISO 32000-1, § 7.8.3 Table 34.
- EXT_G_STATE = '/ExtGState'
External graphics state specifications. See ISO 32000-1, § 8.4.5.
- COLOR_SPACE = '/ColorSpace'
Colour space definitions. See ISO 32000-1, § 8.6.
- PATTERN = '/Pattern'
Pattern definitions. See ISO 32000-1, § 8.7.
- SHADING = '/Shading'
Shading definitions. See ISO 32000-1, § 8.7.4.3.
- XOBJECT = '/XObject'
External object definitions (images and form XObjects). See ISO 32000-1, § 8.8.
- FONT = '/Font'
Font specifications. See ISO 32000-1, § 9.
- PROPERTIES = '/Properties'
Marked content properties. See ISO 32000-1, § 14.6.2.
- exception pyhanko.pdf_utils.content.ResourceManagementError
Bases:
ValueError
Used to signal problems with resource dictionaries.
- class pyhanko.pdf_utils.content.PdfResources
Bases:
object
Representation of a PDF resource dictionary.
This class implements
__getitem__()
withResourceType
keys for dynamic access to its attributes. To merge two instances ofPdfResources
into one another, the class overrides__iadd__()
, so you can write.res1 += res2
Note: Merging two resource dictionaries with conflicting resource names will produce a
ResourceManagementError
.Note: This class is currently only used for new resource dictionaries.
- as_pdf_object() DictionaryObject
Render this instance of
PdfResources
to an actual resource dictionary.
- class pyhanko.pdf_utils.content.PdfContent(resources: PdfResources | None = None, box: BoxConstraints | None = None, writer: BasePdfFileWriter | None = None)
Bases:
object
Abstract representation of part of a PDF content stream.
Warning
Whether
PdfContent
instances can be reused or not is left up to the subclasses.- writer = None
The
__init__()
method comes with an optionalwriter
parameter that can be used to let subclasses register external resources with the writer by themselves.It can also be set after the fact by calling
set_writer()
.
- set_resource(category: ResourceType, name: NameObject, value: PdfObject)
Set a value in the resource dictionary associated with this content fragment.
- Parameters:
category – The resource category to which the resource belongs.
name – The resource’s (internal) name.
value – The resource’s value.
- import_resources(resources: PdfResources)
Import resources from another resource dictionary.
- Parameters:
resources – An instance of
PdfResources
.- Raises:
ResourceManagementError – Raised when there is a resource name conflict.
- property resources: PdfResources
- Returns:
The
PdfResources
instance associated with this content fragment.
- render() bytes
Compile the content to graphics operators.
- as_form_xobject() StreamObject
Render the object to a form XObject to be referenced by another content stream. See ISO 32000-1, § 8.8.
Note: Even if
writer
is set, the resulting form XObject will not be registered. This is left up to the caller.- Returns:
A
StreamObject
instance representing the resulting form XObject.
- set_writer(writer)
Override the currently registered writer object.
- Parameters:
writer – An instance of
BasePdfFileWriter
.
- add_to_page(writer: BasePdfFileWriter, page_ix: int, prepend: bool = False)
Convenience wrapper around
BasePdfFileWriter.add_stream_to_page()
to turn aPdfContent
instance into a page content stream.- Parameters:
writer – A PDF file writer.
page_ix – Index of the page to modify. The first page has index 0.
prepend – Prepend the content stream to the list of content streams, as opposed to appending it to the end. This has the effect of causing the stream to be rendered underneath the already existing content on the page.
- Returns:
An
IndirectObject
reference to the page object that was modified.
- class pyhanko.pdf_utils.content.RawContent(data: bytes, resources: PdfResources | None = None, box: BoxConstraints | None = None)
Bases:
PdfContent
Raw byte sequence to be used as PDF content.
- render() bytes
Compile the content to graphics operators.
- class pyhanko.pdf_utils.content.ImportedPdfPage(file_name, page_ix=0)
Bases:
PdfContent
Import a page from another PDF file (lazily)
- render() bytes
Compile the content to graphics operators.
pyhanko.pdf_utils.embed module
Utility classes for handling embedded files in PDFs.
Added in version 0.7.0.
- pyhanko.pdf_utils.embed.embed_file(pdf_writer: BasePdfFileWriter, spec: FileSpec)
Embed a file in the document-wide embedded file registry of a PDF writer.
- Parameters:
pdf_writer – PDF writer to house the embedded file.
spec – File spec describing the embedded file.
- Returns:
- class pyhanko.pdf_utils.embed.EmbeddedFileObject(pdf_writer: BasePdfFileWriter, dict_data=None, stream_data=None, encoded_data=None, params: EmbeddedFileParams | None = None, mime_type: str | None = None)
Bases:
StreamObject
- classmethod from_file_data(pdf_writer: BasePdfFileWriter, data: bytes, compress=True, params: EmbeddedFileParams | None = None, mime_type: str | None = None) EmbeddedFileObject
Construct an embedded file object from file data.
This is a very thin wrapper around the constructor, with a slightly less intimidating API.
Note
This method will not register the embedded file into the document’s embedded file namespace, see
embed_file()
.- Parameters:
pdf_writer – PDF writer to use.
data – File contents, as a
bytes
object.compress – Whether to compress the embedded file’s contents.
params – Optional embedded file parameters.
mime_type – Optional MIME type string.
- Returns:
An embedded file object.
- write_to_stream(stream, handler=None, container_ref=None)
Abstract method to render this object to an output stream.
- Parameters:
stream – An output stream.
container_ref – Local encryption key.
handler – Security handler
- class pyhanko.pdf_utils.embed.EmbeddedFileParams(embed_size: bool = True, embed_checksum: bool = True, creation_date: datetime.datetime | None = None, modification_date: datetime.datetime | None = None)
Bases:
object
- embed_size: bool = True
If true, record the file size of the embedded file.
Note
This value is computed over the file content before PDF filters are applied. This may have performance implications in cases where the file stream contents are presented in pre-encoded form.
- embed_checksum: bool = True
If true, add an MD5 checksum of the file contents.
Note
This value is computed over the file content before PDF filters are applied. This may have performance implications in cases where the file stream contents are presented in pre-encoded form.
- creation_date: datetime | None = None
Record the creation date of the embedded file.
- modification_date: datetime | None = None
Record the modification date of the embedded file.
- class pyhanko.pdf_utils.embed.FileSpec(file_spec_string: str, file_name: str | None = None, embedded_data: EmbeddedFileObject | None = None, description: str | None = None, af_relationship: NameObject | None = None, f_related_files: List[RelatedFileSpec] | None = None, uf_related_files: List[RelatedFileSpec] | None = None)
Bases:
object
Dataclass modelling an embedded file description in a PDF.
- file_spec_string: str
A path-like file specification string, or URL.
Note
For backwards compatibility, this string should be encodable in PDFDocEncoding. For names that require general Unicode support, refer to
file_name
.
- file_name: str | None = None
A path-like Unicode file name.
- embedded_data: EmbeddedFileObject | None = None
Reference to a stream object containing the file’s data, as embedded in the PDF file.
- description: str | None = None
Textual description of the file.
- af_relationship: NameObject | None = None
Associated file relationship specifier.
Related files with PDFDocEncoded names.
Related files with Unicode-encoded names.
- as_pdf_object() DictionaryObject
Represent the file spec as a PDF dictionary.
- class pyhanko.pdf_utils.embed.RelatedFileSpec(name: str, embedded_data: EmbeddedFileObject)
Bases:
object
Dataclass modelling a RelatedFile construct in PDF.
- name: str
Name of the related file.
Note
The encoding requirements of this field depend on whether the related file is included via the
/F
or/UF
key.
- embedded_data: EmbeddedFileObject
Reference to a stream object containing the file’s data, as embedded in the PDF file.
- pyhanko.pdf_utils.embed.wrap_encrypted_payload(plaintext_payload: bytes, *, password: str | None = None, certs: List[Certificate] | None = None, security_handler: SecurityHandler | None = None, file_spec_string: str = 'attachment.pdf', params: EmbeddedFileParams | None = None, file_name: str | None = None, description='Wrapped document', include_explanation_page=True) PdfFileWriter
Include a PDF document as an encrypted attachment in a wrapper document.
This function sets certain flags in the wrapper document’s collection dictionary to instruct compliant PDF viewers to display the attachment instead of the wrapping document. Viewers that do not fully support PDF collections will display a landing page instead, explaining how to open the attachment manually.
Using this method mitigates some weaknesses in the PDF standard’s encryption provisions, and makes it harder to manipulate the encrypted attachment without knowing the encryption key.
Danger
Until PDF supports authenticated encryption mechanisms, this is a mitigation strategy, not a foolproof defence mechanism.
Warning
While users of viewers that do not support PDF collections can still open the attached file manually, the viewer still has to support PDF files where only the attachments are encrypted.
Note
This is not quite the same as the “unencrypted wrapper document” pattern discussed in the PDF 2.0 specification. The latter is intended to support nonstandard security handlers. This function uses a standard security handler on the wrapping document to encrypt the attachment as a binary blob. Moreover, the functionality in this function is available in PDF 1.7 viewers as well.
- Parameters:
plaintext_payload – The plaintext payload (a binary representation of a PDF document).
security_handler – The security handler to use on the wrapper document. If
None
, a security handler will be constructed based on thepassword
orcerts
parameter.password – Password to encrypt the attachment with. Will be ignored if
security_handler
is provided.certs – Encrypt the file using PDF public-key encryption, targeting the keys in the provided certificates. Will be ignored if
security_handler
is provided.file_spec_string – PDFDocEncoded file spec string for the attachment.
params – Embedded file parameters to use.
file_name – Unicode file name for the attachment.
description – Description for the attachment
include_explanation_page – If
False
, do not generate an explanation page in the wrapper document. This setting could be useful if you want to customise the wrapper document’s behaviour yourself.
- Returns:
A
PdfFileWriter
representing the wrapper document.
pyhanko.pdf_utils.extensions module
- class pyhanko.pdf_utils.extensions.DevExtensionMultivalued(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)
Bases:
Enum
Setting indicating how an extension is expected to behave well w.r.t. the new mechanism for multivalued extensions in ISO 32000-2:2020.
- ALWAYS = 1
Always serialise this extension as a multivalued extension.
- NEVER = 2
Never serialise this extension as a multivalued extension.
- MAYBE = 3
Make this extension single-valued whenever possible, but allow multiple values as well, e.g. when a different but non-comparable extension with the same prefix is already present in the file.
- class pyhanko.pdf_utils.extensions.DeveloperExtension(prefix_name: NameObject, base_version: NameObject, extension_level: int, url: str | None = None, extension_revision: str | None = None, compare_by_level: bool = False, subsumed_by: Iterable[int] = (), subsumes: Iterable[int] = (), multivalued: DevExtensionMultivalued = DevExtensionMultivalued.MAYBE)
Bases:
object
PDF developer extension designation.
- prefix_name: NameObject
Registered developer prefix.
- base_version: NameObject
Base version on to which the extension applies.
- extension_level: int
Extension level.
- url: str | None = None
Optional URL linking to the extension’s documentation.
- extension_revision: str | None = None
Optional extra revision information. Not comparable.
- compare_by_level: bool = False
Compare developer extensions by level number. If this value is
True
and a copy of this extension already exists in the target file with a higher level number, do not override it. If one exists with a lower level number, override it.If this value is
False
, the decision is based onsubsumed_by
andsubsumes
.Warning
It is generally not safe to assume that extension levels are used as a versioning system (i.e. that higher extension levels supersede lower ones), hence why the default is
False
.
- subsumed_by: Iterable[int] = ()
List of extension levels that would subsume this one. If one of these is present in the extensions dictionary, attempting to register this extension will not override it.
Default value: empty.
Warning
This parameter is ignored if
compare_by_level
isTrue
.
- subsumes: Iterable[int] = ()
List of extensions explicitly subsumed by this one. If one of these is present in the extensions dictionary, attempting to register this extension will override it.
Default value: empty.
Warning
This parameter is ignored if
compare_by_level
isTrue
.
- multivalued: DevExtensionMultivalued = 3
Setting indicating whether this extension is expected to behave well w.r.t. the new mechanism for multivalued extensions in ISO 32000-2:2020.
- as_pdf_object() DictionaryObject
Format the data in this object into a PDF dictionary for registration into the /Extensions dictionary.
- Returns:
pyhanko.pdf_utils.filters module
Implementation of stream filters for PDF.
Taken from PyPDF2 with modifications. See here for the original license of the PyPDF2 project.
Note that not all decoders specified in the standard are supported.
In particular /LZWDecode
and the various JPEG-based decoders are missing.
- class pyhanko.pdf_utils.filters.Decoder
Bases:
object
General filter/decoder interface.
- decode(data: bytes, decode_params: dict) bytes
Decode a stream.
- Parameters:
data – Data to decode.
decode_params – Decoder parameters, sourced from the
/DecoderParams
entry associated with this filter.
- Returns:
Decoded data.
- encode(data: bytes, decode_params: dict) bytes
Encode a stream.
- Parameters:
data – Data to encode.
decode_params – Encoder parameters, sourced from the
/DecoderParams
entry associated with this filter.
- Returns:
Encoded data.
- class pyhanko.pdf_utils.filters.ASCII85Decode
Bases:
Decoder
Implementation of the base 85 encoding scheme specified in ISO 32000-1.
- encode(data: bytes, decode_params=None) bytes
Encode a stream.
- Parameters:
data – Data to encode.
decode_params – Encoder parameters, sourced from the
/DecoderParams
entry associated with this filter.
- Returns:
Encoded data.
- decode(data, decode_params=None)
Decode a stream.
- Parameters:
data – Data to decode.
decode_params – Decoder parameters, sourced from the
/DecoderParams
entry associated with this filter.
- Returns:
Decoded data.
- class pyhanko.pdf_utils.filters.ASCIIHexDecode
Bases:
Decoder
Wrapper around
binascii.hexlify()
that implements theDecoder
interface.- encode(data: bytes, decode_params=None) bytes
Encode a stream.
- Parameters:
data – Data to encode.
decode_params – Encoder parameters, sourced from the
/DecoderParams
entry associated with this filter.
- Returns:
Encoded data.
- decode(data, decode_params=None)
Decode a stream.
- Parameters:
data – Data to decode.
decode_params – Decoder parameters, sourced from the
/DecoderParams
entry associated with this filter.
- Returns:
Decoded data.
- class pyhanko.pdf_utils.filters.FlateDecode
Bases:
Decoder
Implementation of the
/FlateDecode
filter.Warning
Currently not all predictor values are supported. This may cause problems when extracting image data from PDF files.
- decode(data: bytes, decode_params)
Decode a stream.
- Parameters:
data – Data to decode.
decode_params – Decoder parameters, sourced from the
/DecoderParams
entry associated with this filter.
- Returns:
Decoded data.
- encode(data, decode_params=None)
Encode a stream.
- Parameters:
data – Data to encode.
decode_params – Encoder parameters, sourced from the
/DecoderParams
entry associated with this filter.
- Returns:
Encoded data.
- pyhanko.pdf_utils.filters.get_generic_decoder(name: str) Decoder
Instantiate a specific stream filter decoder type by (PDF) name.
The following names are recognised:
/FlateDecode
or/Fl
for the decoder implementing Flatecompression.
/ASCIIHexDecode
or/AHx
for the decoder that converts bytes to their hexadecimal representations./ASCII85Decode
or/A85
for the decoder that converts byte strings to a base-85 textual representation.
Warning
/Crypt
is a special case because it requires access to the document’s security handler.Warning
LZW compression is currently unsupported, as are most compression methods that are used specifically for image data.
- Parameters:
name – Name of the decoder to instantiate.
pyhanko.pdf_utils.generic module
Implementation of PDF object types and other generic functionality. The internals were imported from PyPDF2, with modifications.
See here for the original license of the PyPDF2 project.
- class pyhanko.pdf_utils.generic.Dereferenceable
Bases:
object
Represents an opaque reference to a PDF object associated with a PDF Handler (see
PdfHandler
).This can either be a reference to an object with an object ID (see
Reference
) or a reference to the trailer of a PDF document (seeTrailerReference
).- get_object() PdfObject
Retrieve the PDF object backing this dereferenceable.
- Returns:
A
PdfObject
.
- get_pdf_handler()
Return the PDF handler associated with this dereferenceable.
- Returns:
a
PdfHandler
.
- class pyhanko.pdf_utils.generic.Reference(idnum: int, generation: int = 0, pdf: object = None)
Bases:
Dereferenceable
A reference to an object with a certain ID and generation number, with a PDF handler attached to it.
Warning
Contrary to what one might expect, the generation number does not indicate the document revision in which the object was modified. In fact, nonzero generation numbers are exceedingly rare these days; in most real-world PDF files, objects are simply overridden without ever increasing the generation number.
Except in very specific circumstances, dereferencing a
Reference
will return the most recent version of the object with the stated object ID and generation number.- idnum: int
The object’s ID.
- generation: int = 0
The object’s generation number (usually 0)
- pdf: object = None
The PDF handler associated with this reference, an instance of
PdfHandler
.Warning
This field is ignored when hashing or comparing
Reference
objects, so it is the API user’s responsibility to not mix up references originating from unrelated PDF handlers.
- get_object() PdfObject
Retrieve the PDF object backing this dereferenceable.
- Returns:
A
PdfObject
.
- get_pdf_handler()
Return the PDF handler associated with this dereferenceable.
- Returns:
a
PdfHandler
.
- class pyhanko.pdf_utils.generic.TrailerReference(reader)
Bases:
Dereferenceable
A reference to the trailer of a PDF document.
Warning
Since the trailer does not have a well-defined object ID in files with “classical” cross-reference tables (as opposed to cross-reference streams), this is not a subclass of
Reference
.- Parameters:
reader – a
PdfFileReader
- get_object() PdfObject
Retrieve the PDF object backing this dereferenceable.
- Returns:
A
PdfObject
.
- get_pdf_handler()
Return the PDF handler associated with this dereferenceable.
- Returns:
a
PdfHandler
.
- class pyhanko.pdf_utils.generic.PdfObject
Bases:
object
Superclass for all PDF objects.
- container_ref: Dereferenceable | None = None
For objects read from a file, container_ref points to the unique addressable object containing this object.
Note
Consider the following object definition in a PDF file:
4 0 obj << /Foo (Bar) >>
This declares a dictionary with ID 4, but the values
/Foo
and(Bar)
are also PDF objects (a name and a string, respectively). All of these will have container_ref given by aReference
with object ID 4 and generation number 0.If an object is part of the trailer of a PDF file, container_ref will be a
TrailerReference
. For newly created objects (i.e. those not read from a file), container_ref is alwaysNone
.
- get_container_ref() Dereferenceable
Return a reference to the closest parent object containing this object. Raises an error if no such reference can be found.
- get_object()
Resolves indirect references.
- Returns:
self, unless an instance of
IndirectObject
.
- write_to_stream(stream, handler: SecurityHandler | None = None, container_ref: Reference | None = None)
Abstract method to render this object to an output stream.
- Parameters:
stream – An output stream.
container_ref – Local encryption key.
handler – Security handler
- class pyhanko.pdf_utils.generic.IndirectObject(idnum, generation, pdf)
Bases:
PdfObject
,Dereferenceable
Thin wrapper around a
Reference
, implementing both theDereferenceable
andPdfObject
interfaces.Warning
For many purposes, this class is functionally interchangeable with
Reference
, with one important exception:IndirectObject
instances pointing to the same reference but occurring at different locations in the file may have distinct container_ref values.- get_object()
- Returns:
The PDF object this reference points to.
- get_pdf_handler()
Return the PDF handler associated with this dereferenceable.
- Returns:
a
PdfHandler
.
- property idnum: int
- Returns:
the object ID of this reference.
- property generation
- Returns:
the generation number of this reference.
- write_to_stream(stream, handler: SecurityHandler | None = None, container_ref=None)
Abstract method to render this object to an output stream.
- Parameters:
stream – An output stream.
container_ref – Local encryption key.
handler – Security handler
- static read_from_stream(stream, container_ref: Dereferenceable)
- class pyhanko.pdf_utils.generic.NullObject
Bases:
PdfObject
PDF null object.
All instances are treated as equal and falsy.
- write_to_stream(stream, handler: SecurityHandler | None = None, container_ref=None)
Abstract method to render this object to an output stream.
- Parameters:
stream – An output stream.
container_ref – Local encryption key.
handler – Security handler
- static read_from_stream(stream)
- class pyhanko.pdf_utils.generic.BooleanObject(value)
Bases:
PdfObject
PDF boolean value.
- write_to_stream(stream, handler: SecurityHandler | None = None, container_ref=None)
Abstract method to render this object to an output stream.
- Parameters:
stream – An output stream.
container_ref – Local encryption key.
handler – Security handler
- static read_from_stream(stream)
- class pyhanko.pdf_utils.generic.FloatObject(value='0')
Bases:
Decimal
,PdfObject
PDF Float object.
Internally, these are treated as decimals (and therefore actually fixed-point objects, to be precise).
- as_numeric()
- Returns:
a Python
float
value for this object.
- write_to_stream(stream, handler: SecurityHandler | None = None, container_ref=None)
Abstract method to render this object to an output stream.
- Parameters:
stream – An output stream.
container_ref – Local encryption key.
handler – Security handler
- class pyhanko.pdf_utils.generic.NumberObject(value)
Bases:
int
,PdfObject
PDF number object. This is the PDF type for integer values.
- NumberPattern = re.compile(b'[^+-.0-9]')
- ByteDot = b'.'
- as_numeric()
- Returns:
a Python
int
value for this object.
- write_to_stream(stream, handler: SecurityHandler | None = None, container_ref=None)
Abstract method to render this object to an output stream.
- Parameters:
stream – An output stream.
container_ref – Local encryption key.
handler – Security handler
- static read_from_stream(stream)
- class pyhanko.pdf_utils.generic.ByteStringObject
Bases:
bytes
,PdfObject
PDF bytestring class.
- property original_bytes
For compatibility with
TextStringObject.original_bytes
- write_to_stream(stream, handler: SecurityHandler | None = None, container_ref=None)
Abstract method to render this object to an output stream.
- Parameters:
stream – An output stream.
container_ref – Local encryption key.
handler – Security handler
- class pyhanko.pdf_utils.generic.TextStringObject
Bases:
str
,PdfObject
PDF text string object.
- autodetected_encoding: TextStringEncoding | None = None
Autodetected encoding when parsing the file.
- force_output_encoding: TextStringEncoding | None = None
Output encoding to use when serialising the string. The default is to try PDFDocEncoding first, and fall back to UTF-16BE.
- property original_bytes
Retrieve the original bytes of the string as specified in the source file.
This may be necessary if this string was misidentified as a text string.
- write_to_stream(stream, handler: SecurityHandler | None = None, container_ref=None)
Abstract method to render this object to an output stream.
- Parameters:
stream – An output stream.
container_ref – Local encryption key.
handler – Security handler
- class pyhanko.pdf_utils.generic.NameObject
Bases:
str
,PdfObject
PDF name object. These are valid Python strings, but names and strings are treated differently in the PDF specification, so proper care is required.
- write_to_stream(stream, handler: SecurityHandler | None = None, container_ref=None)
Abstract method to render this object to an output stream.
- Parameters:
stream – An output stream.
container_ref – Local encryption key.
handler – Security handler
- static read_from_stream(stream)
- class pyhanko.pdf_utils.generic.ArrayObject(iterable=(), /)
Bases:
list
,PdfObject
PDF array object. This class extends from Python’s list class, and supports its interface.
Warning
Contrary to the case of dictionary objects, PyPDF2 does not transparently dereference array entries when accessed using
__getitem__()
. For usability & consistency reasons, I decided to depart from that and dereference automatically. This makes the behaviour ofArrayObject
consistent withDictionaryObject
.That said, some vestiges of the old PyPDF2 behaviour may linger in the codebase. I’ll fix those as I get to them.
- raw_get(index, decrypt: EncryptedObjAccess = EncryptedObjAccess.TRANSPARENT)
Changed in version 0.14.0:
decrypt
parameter is no longer booleanGet a value from an array without dereferencing. In other words, if the value corresponding to the given key is of type
IndirectObject
, the indirect reference will not be resolved.- Parameters:
index – Key to look up in the dictionary.
decrypt – What to do when retrieving encrypted objects; see
EncryptedObjAccess
. The default isEncryptedObjAccess.TRANSPARENT
.
- Returns:
A
PdfObject
.
- write_to_stream(stream, handler: SecurityHandler | None = None, container_ref=None)
Abstract method to render this object to an output stream.
- Parameters:
stream – An output stream.
container_ref – Local encryption key.
handler – Security handler
- static read_from_stream(stream, container_ref)
- class pyhanko.pdf_utils.generic.DictionaryObject(dict_data=None)
Bases:
dict
,PdfObject
A PDF dictionary object.
Keys in a PDF dictionary are PDF names, and values are PDF objects.
When accessing a key using the standard
__getitem__()
syntax,IndirectObject
references will be resolved.- raw_get(key: NameObject | str, decrypt: EncryptedObjAccess = EncryptedObjAccess.TRANSPARENT)
Changed in version 0.14.0:
decrypt
parameter is no longer booleanGet a value from a dictionary without dereferencing. In other words, if the value corresponding to the given key is of type
IndirectObject
, the indirect reference will not be resolved.- Parameters:
key – Key to look up in the dictionary.
decrypt – What to do when retrieving encrypted objects; see
EncryptedObjAccess
. The default isEncryptedObjAccess.TRANSPARENT
.
- Returns:
A
PdfObject
.
- setdefault(key, value=None)
Insert key with a value of default if key is not in the dictionary.
Return the value for key if key is in the dictionary, else default.
- write_to_stream(stream, handler: SecurityHandler | None = None, container_ref=None)
Abstract method to render this object to an output stream.
- Parameters:
stream – An output stream.
container_ref – Local encryption key.
handler – Security handler
- static read_from_stream(stream, container_ref: Dereferenceable, as_metadata_stream: bool = False)
- class pyhanko.pdf_utils.generic.StreamObject(dict_data: dict | None = None, stream_data: bytes | None = None, encoded_data: bytes | None = None, handler: SecurityHandler | None = None)
Bases:
DictionaryObject
PDF stream object.
Essentially, a PDF stream is a dictionary object with a binary blob of data attached. This data can be encoded by various filters (not all of which are currently supported, see
filters
).A stream object can be initialised with encoded or decoded data. The former is used by
reader.PdfFileReader
to provide on-demand decoding, withwriter.BasePdfFileWriter
and its subclasses working the other way around.Note
The
StreamObject
class manages some of its dictionary keys by itself. This is partly the case for the various/Filter
and/DecodeParms
entries, but also for the/Length
entry. The latter will be overwritten as necessary.- Parameters:
dict_data – The dictionary data for this stream object.
stream_data – The (unencoded) stream data.
encoded_data –
The encoded stream data.
Warning
Ordinarily, a stream can be initialised either from decoded and from encoded data.
If both stream_data and encoded_data are provided, the caller is responsible for making sure that both are compatible given the currently relevant filter configuration.
handler – A reference to the currently active
pyhanko.pdf_utils.crypt.SecurityHandler
. This is only necessary if the stream requires crypt filters.
- add_crypt_filter(name='/Identity', params=None, handler: SecurityHandler | None = None)
- strip_filters()
Ensure the stream is decoded, and remove any filters.
- property data: bytes
Return the decoded stream data as bytes. If the stream hasn’t been decoded yet, it will be decoded on-the-fly.
- Raises:
.misc.PdfStreamError – If the stream could not be decoded.
- property encoded_data: bytes
Return the encoded stream data as bytes. If the stream hasn’t been encoded yet, it will be encoded on-the-fly.
- Raises:
.misc.PdfStreamError – If the stream could not be encoded.
- apply_filter(filter_name, params=None, allow_duplicates: bool | None = True)
Apply a new filter to this stream. This filter will be prepended to any existing filters. This means that is is placed last in the encoding order, but first in the decoding order.
Note: Calling this method on an encoded stream will first cause the stream to be decoded using the filters already present. The cached value for the encoded stream data will be cleared.
- Parameters:
filter_name – Name of the filter (see
DECODERS
)params – Parameters to the filter (will be written to
/DecodeParms
if notNone
)allow_duplicates – If
None
, silently ignore duplicate filters. IfFalse
, raise ValueError when attempting to add a duplicate filter. IfTrue
(default), duplicate filters are allowed.
- compress()
Convenience method to add a
/FlateDecode
filter with default settings, if one is not already present.Note: compression is not actually applied until the stream is written.
- property is_embedded_file_stream
- write_to_stream(stream, handler: SecurityHandler | None = None, container_ref=None)
Abstract method to render this object to an output stream.
- Parameters:
stream – An output stream.
container_ref – Local encryption key.
handler – Security handler
- pyhanko.pdf_utils.generic.read_object(stream, container_ref: Dereferenceable, as_metadata_stream: bool = False) PdfObject
Read a PDF object from an input stream.
Note
The container_ref parameter tells the API which reference to register when the returned object is modified in an incremental update. See also here here for further information.
- Parameters:
stream – An input stream.
container_ref –
A reference to an object containing this one.
Note: It is perfectly possible (and common) for container_ref to resolve to the return value of this function.
as_metadata_stream – Whether to dereference the object as an XMP metadata stream.
- Returns:
A
PdfObject
.
- pyhanko.pdf_utils.generic.pdf_name
alias of
NameObject
- pyhanko.pdf_utils.generic.pdf_string(string: str | bytes | bytearray) ByteStringObject | TextStringObject
Encode a string as a
TextStringObject
if possible, or aByteStringObject
otherwise.- Parameters:
string – A Python string.
- pyhanko.pdf_utils.generic.pdf_date(dt: datetime) TextStringObject
Convert a datetime object into a PDF string. This function supports both timezone-aware and naive datetime objects.
- Parameters:
dt – The datetime object to convert.
- Returns:
A
TextStringObject
representing the datetime passed in.
- class pyhanko.pdf_utils.generic.TextStringEncoding(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)
Bases:
Enum
Encodings for PDF text strings.
- PDF_DOC = None
PDFDocEncoding (one-byte character codes; PDF-specific).
- UTF16BE = (b'\xfe\xff', 'utf-16be')
UTF-16BE encoding.
- UTF8 = (b'\xef\xbb\xbf', 'utf-8')
UTF-8 encoding (PDF 2.0)
- UTF16LE = (b'\xff\xfe', 'utf-16le')
UTF-16LE encoding.
Note
This is strictly speaking invalid in PDF 2.0, but some authoring tools output such strings anyway (presumably due to the fact that it’s the default wide character encoding on Windows).
- encode(string: str) bytes
Encode a string with BOM.
- Parameters:
string – The string to encode.
- Returns:
The encoded string.
- decode(string: bytes | bytearray) str
Decode a string with BOM.
- Parameters:
string – The string to encode.
- Returns:
The encoded string.
- Raises:
UnicodeDecodeError – Raised if decoding fails.
- class pyhanko.pdf_utils.generic.EncryptedObjAccess(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)
Bases:
Enum
Defines what to do when an encrypted object is encountered when retrieving an object from a container.
- PROXY = 0
Return the proxy object as-is, and leave further encryption/decryption handling to the caller.
- TRANSPARENT = 1
Transparently decrypt the proxy’s content (similarly wrapping any sub-containers in
DecryptedObjectProxy
, so this applies recursively).Note
This is the default in most situations, since it’s the least likely to get in the way of any APIs that are not explicitly aware of content encryption concerns.
- RAW = 2
Return the underlying raw object as written, without attempting or deferring decryption.
- class pyhanko.pdf_utils.generic.DecryptedObjectProxy(raw_object: PdfObject, handler)
Bases:
PdfObject
Internal proxy class that allows transparent on-demand encryption of objects.
Warning
Most public-facing APIs won’t leave you to deal with these directly (that’s half the reason this class exists in the first place), and the API of this class is considered internal.
However, for reasons related to the historical PyPDF2 codebase from which pyHanko’s object handling code ultimately derives, there are some Python builtins that might cause these wrapper objects to inadvertently “leak”. Please tell us about such cases so we can make those types of access more convenient and robust.
Danger
The
__eq__
implementation on this class is not safe for general use, due to the fact that certain structures in PDF are exempt from encryption. Only compare proxy objects with==
in areas of the document where these exemptions don’t apply.- Parameters:
raw_object – A raw object, typically as-parsed from a PDF file.
handler – The security handler governing this object.
- property decrypted: PdfObject
The decrypted PDF object exposed as a property.
If this object is a container object, its constituent parts will be wrapped in
DecryptedObjectProxy
as well, in order to defer further decryption until the values are requested through a getter method on the container.
- write_to_stream(stream, handler: SecurityHandler | None = None, container_ref=None)
Abstract method to render this object to an output stream.
- Parameters:
stream – An output stream.
container_ref – Local encryption key.
handler – Security handler
- get_object()
Resolves indirect references.
- Returns:
self, unless an instance of
IndirectObject
.
- property container_ref
pyhanko.pdf_utils.images module
Utilities for embedding bitmap image data into PDF files.
The image data handling is done by Pillow.
Note
Note that also here we only support a subset of what the PDF standard provides for. Most RGB and grayscale images (with or without transparency) that can be read by PIL/Pillow can be used without issue. PNG images with an indexed palette backed by one of these colour spaces can also be used.
Currently there is no support for CMYK images or (direct) support for embedding JPEG-encoded image data as such, but these features may be added later.
- pyhanko.pdf_utils.images.pil_image(img: Image, writer: BasePdfFileWriter)
This function writes a PIL/Pillow
Image
object to a PDF file writer, as an image XObject.- Parameters:
img – A Pillow
Image
objectwriter – A PDF file writer
- Returns:
A reference to the image XObject written.
- class pyhanko.pdf_utils.images.PdfImage(image: Image | str, writer: BasePdfFileWriter | None = None, resources: PdfResources | None = None, name: str | None = None, opacity=None, box: BoxConstraints | None = None)
Bases:
PdfContent
Wrapper class that implements the
PdfContent
interface for image objects.Note
Instances of this class are reusable, in the sense that the implementation is aware of changes to the associated
writer
object. This allows the same image to be embedded into multiple files without instantiating a newPdfImage
every time.- property image_ref: IndirectObject
Return a reference to the image XObject associated with this
PdfImage
instance. If no such reference is available, it will be created usingpil_image()
, and the result will be cached until thewriter
attribute changes (seeset_writer()
).- Returns:
An indirect reference to an image XObject.
- render() bytes
Compile the content to graphics operators.
pyhanko.pdf_utils.incremental_writer module
Utility for writing incremental updates to existing PDF files.
- class pyhanko.pdf_utils.incremental_writer.IncrementalPdfFileWriter(input_stream, prev: PdfFileReader | None = None, strict=True)
Bases:
BasePdfFileWriter
Class to incrementally update existing files.
This
BasePdfFileWriter
subclass encapsulates aPdfFileReader
instance in addition to exposing an interface to add and modify PDF objects.Incremental updates to a PDF file append modifications to the end of the file. This is critical when the original file contents are not to be modified directly (e.g. when it contains digital signatures). It has the additional advantage of providing an automatic audit trail of sorts.
- Parameters:
input_stream – Input stream to read current revision from.
strict – Ingest the source file in strict mode. The default is
True
.prev – Explicitly pass in a PDF reader. This parameter is internal API.
- IO_CHUNK_SIZE = 4096
- classmethod from_reader(reader: PdfFileReader) IncrementalPdfFileWriter
Instantiate an incremental writer from a PDF file reader.
- Parameters:
reader – A
PdfFileReader
object with a PDF to extend.
- ensure_output_version(version)
- get_object(ido, as_metadata_stream: bool = False)
Retrieve the object associated with the provided reference from this PDF handler.
- Parameters:
ref – An instance of
generic.Reference
.as_metadata_stream – Whether to dereference the object as an XMP metadata stream.
- Returns:
A PDF object.
- mark_update(obj_ref: Reference | IndirectObject)
Mark an object reference to be updated. This is only relevant for incremental updates, but is included as a no-op by default for interoperability reasons.
- Parameters:
obj_ref – An indirect object instance or a reference.
- update_container(obj: PdfObject)
Mark the container of an object (as indicated by the
container_ref
attribute onPdfObject
) for an update.As with
mark_update()
, this only applies to incremental updates, but defaults to a no-op.- Parameters:
obj – The object whose top-level container needs to be rewritten.
- update_root()
Signal that the document catalog should be written to the output. Equivalent to calling
mark_update()
withroot_ref
.
- set_info(info: IndirectObject | DictionaryObject | None)
Set the
/Info
entry of the document trailer.- Parameters:
info – The new
/Info
dictionary, as an indirect reference.
- set_custom_trailer_entry(key: NameObject, value: PdfObject)
Set a custom, unmanaged entry in the document trailer or cross-reference stream dictionary.
Warning
Calling this method to set an entry that is managed by pyHanko internally (info dictionary, document catalog, etc.) has undefined results.
- Parameters:
key – Dictionary key to use in the trailer.
value – Value to set
- write(stream)
Write the contents of this PDF writer to a stream.
- Parameters:
stream – A writable output stream.
- property document_meta_view: DocumentMetadata
- write_in_place()
Write the updated file contents in-place to the same stream as the input stream. This obviously requires a stream supporting both reading and writing operations.
- encrypt(user_pwd)
Method to handle updates to encrypted files.
This method handles decrypting of the original file, and makes sure the resulting updated file is encrypted in a compatible way. The standard mandates that updates to encrypted files be effected using the same encryption settings. In particular, incremental updates cannot remove file encryption.
- Parameters:
user_pwd – The original file’s user password.
- Raises:
PdfReadError – Raised when there is a problem decrypting the file.
- encrypt_pubkey(credential: EnvelopeKeyDecrypter)
Method to handle updates to files encrypted using public-key encryption.
The same caveats as
encrypt()
apply here.- Parameters:
credential – The
EnvelopeKeyDecrypter
handling the recipient’s private key.- Raises:
PdfReadError – Raised when there is a problem decrypting the file.
pyhanko.pdf_utils.layout module
Layout utilities (to be expanded)
- exception pyhanko.pdf_utils.layout.LayoutError(msg: str, *args)
Bases:
ValueError
Indicates an error in a layout computation.
- exception pyhanko.pdf_utils.layout.BoxSpecificationError(msg: str | None = None)
Bases:
LayoutError
Raised when a box constraint is over/underspecified.
- class pyhanko.pdf_utils.layout.BoxConstraints(width: int | float | None = None, height: int | float | None = None, aspect_ratio: Fraction | None = None)
Bases:
object
Represents a box of potentially variable width and height. Among other uses, this can be leveraged to produce a variably sized box with a fixed aspect ratio.
If width/height are not defined yet, they can be set by assigning to the
width
andheight
attributes.- property width: int
- Returns:
The width of the box.
- Raises:
BoxSpecificationError – if the box’s width could not be determined.
- property width_defined: bool
- Returns:
True
if the box currently has a well-defined width,False
otherwise.
- property height: int
- Returns:
The height of the box.
- Raises:
BoxSpecificationError – if the box’s height could not be determined.
- property height_defined: bool
- Returns:
True
if the box currently has a well-defined height,False
otherwise.
- property aspect_ratio: Fraction
- Returns:
The aspect ratio of the box.
- Raises:
BoxSpecificationError – if the box’s aspect ratio could not be determined.
- property aspect_ratio_defined: bool
- Returns:
True
if the box currently has a well-defined aspect ratio,False
otherwise.
- class pyhanko.pdf_utils.layout.AxisAlignment(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)
Bases:
Enum
Class representing one-dimensional alignment along an axis.
- ALIGN_MIN = 1
Align maximally towards the negative end of the axis.
- ALIGN_MID = 2
Center content along the axis.
- ALIGN_MAX = 3
Align maximally towards the positive end of the axis.
- classmethod from_x_align(align_str: str) AxisAlignment
Convert from a horizontal alignment config string.
- Parameters:
align_str – A string: ‘left’, ‘mid’ or ‘right’.
- Returns:
An
AxisAlignment
value.- Raises:
ConfigurationError – on unexpected string inputs.
- classmethod from_y_align(align_str: str) AxisAlignment
Convert from a vertical alignment config string.
- Parameters:
align_str – A string: ‘bottom’, ‘mid’ or ‘top’.
- Returns:
An
AxisAlignment
value.- Raises:
ConfigurationError – on unexpected string inputs.
- property flipped
- align(container_len: int, inner_len: int, pre_margin, post_margin) int
- class pyhanko.pdf_utils.layout.Margins(left: int = 0, right: int = 0, top: int = 0, bottom: int = 0)
Bases:
ConfigurableMixin
Class describing a set of margins.
- left: int = 0
- right: int = 0
- top: int = 0
- bottom: int = 0
- classmethod uniform(num)
Return a set of uniform margins.
- Parameters:
num – The uniform margin to apply to all four sides.
- Returns:
Margins(num, num, num, num)
- static effective(dim_name, container_len, pre, post)
Internal helper method to compute effective margins.
- effective_width(width)
Compute width without margins.
- Parameters:
width – The container width.
- Returns:
The width after subtracting the left and right margins.
- Raises:
LayoutError – if the container width is too short to accommodate the margins.
- effective_height(height)
Compute height without margins.
- Parameters:
height – The container height.
- Returns:
The height after subtracting the top and bottom margins.
- Raises:
LayoutError – if the container height is too short to accommodate the margins.
- classmethod from_config(config_dict)
Attempt to instantiate an object of the class on which it is called, by means of the configuration settings passed in.
First, we check that the keys supplied in the dictionary correspond to data fields on the current class. Then, the dictionary is processed using the
process_entries()
method. The resulting dictionary is passed to the initialiser of the current class as a kwargs dict.- Parameters:
config_dict – A dictionary containing configuration values.
- Returns:
An instance of the class on which it is called.
- Raises:
ConfigurationError – when an unexpected configuration key is encountered or left unfilled, or when there is a problem processing one of the config values.
- class pyhanko.pdf_utils.layout.InnerScaling(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)
Bases:
Enum
Class representing a scaling convention.
- NO_SCALING = 1
Never scale content.
- STRETCH_FILL = 2
Scale content to fill the entire container.
- STRETCH_TO_FIT = 3
Scale content while preserving aspect ratio until either the maximal width or maximal height is reached.
- SHRINK_TO_FIT = 4
Scale content down to fit in the container, while preserving the original aspect ratio.
- classmethod from_config(config_str: str) InnerScaling
Convert from a configuration string.
- Parameters:
config_str – A string: ‘none’, ‘stretch-fill’, ‘stretch-to-fit’, ‘shrink-to-fit’
- Returns:
An
InnerScaling
value.- Raises:
ConfigurationError – on unexpected string inputs.
- class pyhanko.pdf_utils.layout.SimpleBoxLayoutRule(x_align: AxisAlignment, y_align: AxisAlignment, margins: Margins = Margins(left=0, right=0, top=0, bottom=0), inner_content_scaling: InnerScaling = InnerScaling.SHRINK_TO_FIT)
Bases:
ConfigurableMixin
Class describing alignment, scaling and margin rules for a box positioned inside another box.
- x_align: AxisAlignment
Horizontal alignment settings.
- y_align: AxisAlignment
Vertical alignment settings.
- margins: Margins = Margins(left=0, right=0, top=0, bottom=0)
Container (inner) margins. Defaults to all zeroes.
- inner_content_scaling: InnerScaling = 4
Inner content scaling rule.
- classmethod process_entries(config_dict)
Hook method that can modify the configuration dictionary to overwrite or tweak some of their values (e.g. to convert string parameters into more complex Python objects)
Subclasses that override this method should call
super().process_entries()
, and leave keys that they do not recognise untouched.- Parameters:
config_dict – A dictionary containing configuration values.
- Raises:
ConfigurationError – when there is a problem processing a relevant entry.
- substitute_margins(new_margins: Margins) SimpleBoxLayoutRule
- fit(container_box: BoxConstraints, inner_nat_width: int, inner_nat_height: int) Positioning
Position and possibly scale a box within a container, according to this layout rule.
- Parameters:
container_box –
BoxConstraints
describing the container.inner_nat_width – The inner box’s natural width.
inner_nat_height – The inner box’s natural height.
- Returns:
A
Positioning
describing the scaling & position of the lower left corner of the inner box.
- class pyhanko.pdf_utils.layout.Positioning(x_pos: int, y_pos: int, x_scale: float, y_scale: float)
Bases:
ConfigurableMixin
Class describing the position and scaling of an object in a container.
- x_pos: int
Horizontal coordinate
- y_pos: int
Vertical coordinate
- x_scale: float
Horizontal scaling
- y_scale: float
Vertical scaling
- as_cm()
Convenience method to convert this
Positioning
into a PDFcm
operator.- Returns:
A byte string representing the
cm
operator corresponding to thisPositioning
.
pyhanko.pdf_utils.misc module
Utility functions for PDF library. Taken from PyPDF2 with modifications and additions, see here for the original license of the PyPDF2 project.
Generally, all of these constitute internal API, except for the exception classes.
- exception pyhanko.pdf_utils.misc.PdfError(msg: str, *args)
Bases:
Exception
- exception pyhanko.pdf_utils.misc.PdfStrictReadError(msg: str, *args)
Bases:
PdfReadError
- exception pyhanko.pdf_utils.misc.PdfStreamError(msg: str, *args)
Bases:
PdfReadError
- exception pyhanko.pdf_utils.misc.IndirectObjectExpected(msg: str | None = None)
Bases:
PdfReadError
- pyhanko.pdf_utils.misc.get_and_apply(dictionary: dict, key, function: Callable, *, default=None)
- class pyhanko.pdf_utils.misc.OrderedEnum(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)
Bases:
Enum
Ordered enum (from the Python documentation)
- class pyhanko.pdf_utils.misc.StringWithLanguage(value: str, lang_code: str | None = None, country_code: str | None = None)
Bases:
object
A string with a language attached to it.
- value: str
- lang_code: str | None = None
- country_code: str | None = None
- pyhanko.pdf_utils.misc.is_regular_character(byte_value: int)
- pyhanko.pdf_utils.misc.read_non_whitespace(stream, seek_back=False, allow_eof=False)
Finds and reads the next non-whitespace character (ignores whitespace).
- pyhanko.pdf_utils.misc.read_until_whitespace(stream, maxchars: int | None = None) bytes
Reads non-whitespace characters and returns them. Stops upon encountering whitespace, or, if
maxchars
is notNone
, when maxchars is reached.- Parameters:
stream – stream to read
maxchars – maximal number of bytes to read before returning
- pyhanko.pdf_utils.misc.read_until_delimiter(stream) bytes
Read until a token delimiter (i.e. a delimiter character or a PDF whitespace character) is encountered, and rewind the stream to the previous character.
- Parameters:
stream – A stream.
- Returns:
The bytes read.
- pyhanko.pdf_utils.misc.read_until_regex(stream, regex, ignore_eof: bool = False)
Reads until the regular expression pattern matched (ignore the match) Raise
PdfStreamError
on premature end-of-file.- Parameters:
stream – stream to search
regex – regex to match
ignore_eof – if true, ignore end-of-line and return immediately
- Raises:
PdfStreamError – on premature EOF
- pyhanko.pdf_utils.misc.skip_over_whitespace(stream, stop_after_eol=False) bool
Similar to
read_non_whitespace()
, but returns abool
if more than one whitespace character was read.Will return the cursor to before the first non-whitespace character encountered, or after the first end-of-line sequence if one is encountered.
- pyhanko.pdf_utils.misc.skip_over_comment(stream) bool
Skip over a comment and position the cursor at the first byte after the EOL sequence following the comment. If there is no comment under the cursor, do nothing.
- Parameters:
stream – stream to read
- Returns:
True
if a comment was read.
- pyhanko.pdf_utils.misc.instance_test(cls)
- pyhanko.pdf_utils.misc.peek(itr)
- pyhanko.pdf_utils.misc.assert_writable_and_random_access(output)
Raise an error if the buffer in question is not writable, and return a boolean to indicate whether it supports random-access reading.
- Parameters:
output
- Returns:
- pyhanko.pdf_utils.misc.prepare_rw_output_stream(output)
Prepare an output stream that supports both reading and writing. Intended to be used for writing & updating signed files: when producing a signature, we render the PDF to a byte buffer with placeholder values for the signature data, or straight to the provided output stream if possible.
More precisely: this function will return the original output stream if it is writable, readable and seekable. If the
output
parameter isNone
, not readable or not seekable, this function will return aBytesIO
instance instead. If theoutput
parameter is notNone
and not writable,IOError
will be raised.- Parameters:
output – A writable file-like object, or
None
.- Returns:
A file-like object that supports reading, writing and seeking.
- pyhanko.pdf_utils.misc.finalise_output(orig_output, returned_output)
Several internal APIs transparently replaces non-readable/seekable buffers with BytesIO for signing operations, but we don’t want to expose that to the public API user. This internal API function handles the unwrapping.
- pyhanko.pdf_utils.misc.DEFAULT_CHUNK_SIZE = 4096
Default chunk size for stream I/O.
- pyhanko.pdf_utils.misc.chunked_write(temp_buffer: bytearray, stream, output, max_read=None)
- pyhanko.pdf_utils.misc.chunked_digest(temp_buffer: bytearray, stream, md, max_read=None)
- pyhanko.pdf_utils.misc.chunk_stream(temp_buffer: bytearray | memoryview, stream, max_read=None) Iterable[bytearray | memoryview]
- class pyhanko.pdf_utils.misc.ConsList(head: 'Optional[ListElem]', tail: 'Optional[ConsList[ListElem]]' = None)
Bases:
Generic
[ListElem
]- head: ListElem | None
- property last: ListElem | None
- class pyhanko.pdf_utils.misc.Singleton(name, bases, dct)
Bases:
type
- pyhanko.pdf_utils.misc.rd(x)
- pyhanko.pdf_utils.misc.isoparse(dt_str: str) datetime
- pyhanko.pdf_utils.misc.lift_iterable_async(i: Iterable[X]) CancelableAsyncIterator[X]
pyhanko.pdf_utils.qr module
- class pyhanko.pdf_utils.qr.PdfStreamQRImage(border, width, box_size, *args, **kwargs)
Bases:
BaseImage
Quick-and-dirty implementation of the Image interface required by the qrcode package.
- kind: str | None = 'PDF'
- allowed_kinds: Tuple[str] | None = ('PDF',)
- qr_color = (0, 0, 0)
- new_image(**kwargs)
Build the image class. Subclasses should return the class created.
- drawrect(row, col)
Draw a single rectangle of the QR code.
- append_single_rect(command_stream, row, col)
- format_qr_color()
- setup_drawing_area()
- render_command_stream()
- save(stream, kind=None)
Save the image file.
- process()
Processes QR code after completion
- drawrect_context(row, col, active, context)
Draw a single rectangle of the QR code given the surrounding context
- class pyhanko.pdf_utils.qr.PdfFancyQRImage(border, width, box_size, *_args, version, center_image: PdfContent | None = None, **kwargs)
Bases:
PdfStreamQRImage
- centerpiece_corner_radius = 0.2
- save(stream, kind=None)
Save the image file.
- process()
Processes QR code after completion
- append_single_rect(command_stream, row, col)
- is_major_position_pattern(row, col)
- is_position_pattern(row, col)
- draw_position_patterns()
- draw_centerpiece()
- setup_drawing_area()
- render_command_stream()
- pyhanko.pdf_utils.qr.rounded_square(x_pos: float, y_pos: float, sz: float, rad: float) List[bytes]
Add a subpath of a square with rounded corners at the given position. Doesn’t include any painting or clipping operations.
The path is drawn counterclockwise.
- Parameters:
x_pos – The x-coordinate of the enveloping square’s lower left corner.
y_pos – The y-coordinate of the enveloping square’s lower left corner.
sz – The side length of the enveloping square.
rad – The corner radius.
- Returns:
A list of graphics operators.
pyhanko.pdf_utils.reader module
Utility to read PDF files. Contains code from the PyPDF2 project; see here for the original license.
The implementation was tweaked with the express purpose of facilitating historical inspection and auditing of PDF files with multiple revisions through incremental updates. This comes at a cost, and future iterations of this module may offer more flexibility in terms of the level of detail with which file size is scrutinised.
- class pyhanko.pdf_utils.reader.PdfFileReader(stream, strict: bool = True)
Bases:
PdfHandler
Class implementing functionality to read a PDF file and cache certain data about it.
- last_startxref = None
- has_xref_stream = False
- property security_handler
- property document_meta_view: DocumentMetadata
- property input_version
- property encrypt_dict: DictionaryObject | None
- property trailer_view: DictionaryObject
Returns a view of the document trailer of the document represented by this
PdfHandler
instance.The view is effectively read-only, in the sense that any writes will not be reflected in the actual trailer (if the handler supports writing, that is).
- Returns:
A
generic.DictionaryObject
representing the current state of the document trailer.
- property document_id: Tuple[bytes, bytes]
- get_historical_root(revision: int)
Get the document catalog for a specific revision.
- Parameters:
revision – The revision to query, the oldest one being 0.
- Returns:
The value of the document catalog dictionary for that revision.
- property total_revisions: int
- Returns:
The total number of revisions made to this file.
- get_object(ref, revision=None, never_decrypt=False, transparent_decrypt=True, as_metadata_stream=False)
Read an object from the input stream.
- Parameters:
ref –
Reference
to the object.revision – Revision number, to return the historical value of a reference. This always bypasses the cache. The oldest revision is numbered 0. See also
HistoricalResolver
.never_decrypt – Skip decryption step (only needed for parsing
/Encrypt
)transparent_decrypt –
If
True
, all encrypted objects are transparently decrypted by default (in the sense that a user of the API in a PyPDF2 compatible way would only “see” decrypted objects). IfFalse
, this method may return a proxy object that still allows access to the “original”.Danger
The encryption parameters are considered internal, undocumented API, and subject to change without notice.
as_metadata_stream – Whether to dereference the object as an XMP metadata stream.
- Returns:
A
PdfObject
.- Raises:
PdfReadError – Raised if there is an issue reading the object from the file.
- cache_get_indirect_object(generation, idnum)
- cache_indirect_object(generation, idnum, obj)
- decrypt(password: str | bytes) AuthResult
When using an encrypted PDF file with the standard PDF encryption handler, this function will allow the file to be decrypted. It checks the given password against the document’s user password and owner password, and then stores the resulting decryption key if either password is correct.
Both legacy encryption schemes and PDF 2.0 encryption (based on AES-256) are supported.
Danger
Supplying either user or owner password will work. Cryptographically, both allow the decryption key to be computed, but processors are expected to adhere to the
/P
flags in the encryption dictionary when accessing a file with the user password. Currently, pyHanko does not enforce these restrictions, but it may in the future.Danger
One should also be aware that the legacy encryption schemes used prior to PDF 2.0 are (very) weak, and we only support them for compatibility reasons. Under no circumstances should these still be used to encrypt new files.
- Parameters:
password – The password to match.
- decrypt_pubkey(credential: EnvelopeKeyDecrypter) AuthResult
Decrypt a PDF file encrypted using public-key encryption by providing a credential representing the private key of one of the recipients.
Danger
The same caveats as in
decrypt()
w.r.t. permission handling apply to this method.Danger
The robustness of the public key cipher being used is not the only factor in the security of public-key encryption in PDF. The standard still permits weak schemes to encrypt the actual file data and file keys. PyHanko uses sane defaults everywhere, but other software may not.
- Parameters:
credential – The
EnvelopeKeyDecrypter
handling the recipient’s private key.
- property encrypted
- Returns:
True
if a document is encrypted,False
otherwise.
- get_historical_resolver(revision: int) HistoricalResolver
Return a
PdfHandler
instance that provides a view on the file at a specific revision.- Parameters:
revision – The revision number to use, with 0 being the oldest.
- Returns:
An instance of
HistoricalResolver
.
- property embedded_signatures
- Returns:
The signature objects embedded in this document, in signing order; see
EmbeddedPdfSignature
.
- property embedded_regular_signatures
- Returns:
The signature objects of type
/Sig
embedded in this document, in signing order; seeEmbeddedPdfSignature
.
- property embedded_timestamp_signatures
- Returns:
The signature objects of type
/DocTimeStamp
embedded in this document, in signing order; seeEmbeddedPdfSignature
.
- class pyhanko.pdf_utils.reader.HistoricalResolver(reader: PdfFileReader, revision)
Bases:
PdfHandler
PdfHandler
implementation that provides a view on a particular revision of a PDF file.Instances of
HistoricalResolver
should be created by calling theget_historical_resolver()
method on aPdfFileReader
object.Instances of this class cache the result of
get_object()
calls.Danger
This class is documented, but is nevertheless considered internal API, and easy to misuse.
In particular, the container_ref attribute must not be relied upon for objects retrieved from a
HistoricalResolver
. Internally, it is only used to make lazy decryption work in historical revisions.Note
Be aware that instances of this class transparently rewrite the PDF handler associated with any reference objects returned from the reader, so calling
get_object()
on an indirect reference object will cause the reference to be resolved within the selected revision.- property document_meta_view: DocumentMetadata
- property document_id: Tuple[bytes, bytes]
- property trailer_view: DictionaryObject
Returns a view of the document trailer of the document represented by this
PdfHandler
instance.The view is effectively read-only, in the sense that any writes will not be reflected in the actual trailer (if the handler supports writing, that is).
- Returns:
A
generic.DictionaryObject
representing the current state of the document trailer.
- get_object(ref: Reference, as_metadata_stream: bool = False)
Retrieve the object associated with the provided reference from this PDF handler.
- Parameters:
ref – An instance of
generic.Reference
.as_metadata_stream – Whether to dereference the object as an XMP metadata stream.
- Returns:
A PDF object.
- explicit_refs_in_revision()
- refs_freed_in_revision()
- object_streams_used()
- is_ref_available(ref: Reference) bool
Check if the reference in question was in scope for this revision. This call doesn’t care about the specific semantics of free vs. used objects; it conservatively answers ‘no’ in any situation where the object ID _could_ have been assigned by the revision in question.
- Parameters:
ref – A reference object (usually one written to by a newer revision)
- Returns:
True
if the reference is unassignable,False
otherwise.
- collect_dependencies(obj: PdfObject, since_revision=None) Set[Reference]
Collect all indirect references used by an object and its descendants.
- Parameters:
obj – The object to inspect.
since_revision –
Optionally specify a revision number that tells the scanner to only include objects IDs that were added in that revision or later.
Warning
In particular, this means that the scanner will not recurse into older objects either.
- Returns:
A
set
ofReference
objects.
- pyhanko.pdf_utils.reader.parse_catalog_version(version_str) Tuple[int, int] | None
- class pyhanko.pdf_utils.reader.RawPdfPath(*path: str | int)
Bases:
object
Class to model raw paths in a file.
This class is internal API.
- pyhanko.pdf_utils.reader.process_data_at_eof(stream) int
Auxiliary function that reads backwards from the current position in a stream to find the EOF marker and startxref value
This is internal API.
- Parameters:
stream – A stream to read from
- Returns:
The value of the startxref pointer, if found. Otherwise a PdfReadError is raised.
pyhanko.pdf_utils.rw_common module
Utilities common to reading and writing PDF files.
- class pyhanko.pdf_utils.rw_common.PdfHandler
Bases:
object
Abstract class providing a general interface for quering objects in PDF readers and writers alike.
- get_object(ref: Reference, as_metadata_stream: bool = False)
Retrieve the object associated with the provided reference from this PDF handler.
- Parameters:
ref – An instance of
generic.Reference
.as_metadata_stream – Whether to dereference the object as an XMP metadata stream.
- Returns:
A PDF object.
- property trailer_view: DictionaryObject
Returns a view of the document trailer of the document represented by this
PdfHandler
instance.The view is effectively read-only, in the sense that any writes will not be reflected in the actual trailer (if the handler supports writing, that is).
- Returns:
A
generic.DictionaryObject
representing the current state of the document trailer.
- property document_meta_view: DocumentMetadata
- property root: DictionaryObject
- Returns:
The document catalog of this PDF handler.
- property document_id: Tuple[bytes, bytes]
- find_page_container(page_ix)
Retrieve the node in the page tree containing the page with index
page_ix
, along with the necessary objects to modify it in an incremental update scenario.- Parameters:
page_ix – The (zero-indexed) number of the page for which we want to retrieve the parent. A negative number counts pages from the back of the document, with index
-1
referring to the last page.- Returns:
A triple with the
/Pages
object (or a reference to it), the index of the target page in said/Pages
object, and a (possibly inherited) resource dictionary.
- find_page_for_modification(page_ix)
Retrieve the page with index
page_ix
from the page tree, along with the necessary objects to modify it in an incremental update scenario.- Parameters:
page_ix – The (zero-indexed) number of the page to retrieve. A negative number counts pages from the back of the document, with index
-1
referring to the last page.- Returns:
A tuple with a reference to the page object and a (possibly inherited) resource dictionary.
pyhanko.pdf_utils.text module
Utilities related to text rendering & layout.
- class pyhanko.pdf_utils.text.TextStyle(font: ~pyhanko.pdf_utils.font.api.FontEngineFactory = <factory>, font_size: int = 10, leading: int | None = None)
Bases:
ConfigurableMixin
Container for basic test styling settings.
- font: FontEngineFactory
The
FontEngineFactory
to be used for this text style. Defaults to Courier (as a non-embedded standard font).
- font_size: int = 10
Font size to be used.
- classmethod process_entries(config_dict)
Hook method that can modify the configuration dictionary to overwrite or tweak some of their values (e.g. to convert string parameters into more complex Python objects)
Subclasses that override this method should call
super().process_entries()
, and leave keys that they do not recognise untouched.- Parameters:
config_dict – A dictionary containing configuration values.
- Raises:
ConfigurationError – when there is a problem processing a relevant entry.
- class pyhanko.pdf_utils.text.TextBoxStyle(font: ~pyhanko.pdf_utils.font.api.FontEngineFactory = <factory>, font_size: int = 10, leading: int | None = None, border_width: int = 0, box_layout_rule: ~pyhanko.pdf_utils.layout.SimpleBoxLayoutRule | None = None, vertical_text: bool = False)
Bases:
TextStyle
Extension of
TextStyle
for use in text boxes.- border_width: int = 0
Border width, if applicable.
- box_layout_rule: SimpleBoxLayoutRule | None = None
Layout rule to nest the text within its containing box.
Warning
This only affects the position of the text object, not the alignment of the text within.
- vertical_text: bool = False
Switch layout code to vertical mode instead of horizontal mode.
- class pyhanko.pdf_utils.text.TextBox(style: TextBoxStyle, writer, resources: PdfResources | None = None, box: BoxConstraints | None = None, font_name='F1')
Bases:
PdfContent
Implementation of a text box that implements the
PdfContent
interface.Note
Text boxes currently don’t offer automatic word wrapping.
- put_string_line(txt)
- property content_lines
- Returns:
Text content of the text box, broken up into lines.
- property content
- Returns:
The actual text content of the text box. This is a modifiable property.
In textboxes that don’t have a fixed size, setting this property can cause the text box to be resized.
- property leading
- Returns:
The effective leading value, i.e. the
leading
attribute of the associatedTextBoxStyle
, orfont_size
if not specified.
- render()
Compile the content to graphics operators.
pyhanko.pdf_utils.writer module
Utilities for writing PDF files. Contains code from the PyPDF2 project; see here for the original license.
- class pyhanko.pdf_utils.writer.BasePdfFileWriter(root: IndirectObject | DictionaryObject, info: IndirectObject | DictionaryObject | None, document_id: ArrayObject, obj_id_start: int = 0, stream_xrefs: bool = True)
Bases:
PdfHandler
Base class for PDF writers.
- output_version = (1, 7)
Output version to be declared in the output file.
- stream_xrefs: bool
Boolean controlling whether or not the output file will contain its cross-references in stream format, or as a classical XRef table.
The default for new files is
True
. For incremental updates, the writer adapts to the system used in the previous iteration of the document (as mandated by the standard).
- get_subset_collection(base_postscript_name: str)
- property document_meta: DocumentMetadata
- property document_meta_view: DocumentMetadata
- ensure_output_version(version)
- set_info(info: IndirectObject | DictionaryObject | None) IndirectObject | None
Set the
/Info
entry of the document trailer.- Parameters:
info – The new
/Info
dictionary, as an indirect reference.
- set_custom_trailer_entry(key: NameObject, value: PdfObject)
Set a custom, unmanaged entry in the document trailer or cross-reference stream dictionary.
Warning
Calling this method to set an entry that is managed by pyHanko internally (info dictionary, document catalog, etc.) has undefined results.
- Parameters:
key – Dictionary key to use in the trailer.
value – Value to set
- property document_id: Tuple[bytes, bytes]
- mark_update(obj_ref: Reference | IndirectObject)
Mark an object reference to be updated. This is only relevant for incremental updates, but is included as a no-op by default for interoperability reasons.
- Parameters:
obj_ref – An indirect object instance or a reference.
- update_container(obj: PdfObject)
Mark the container of an object (as indicated by the
container_ref
attribute onPdfObject
) for an update.As with
mark_update()
, this only applies to incremental updates, but defaults to a no-op.- Parameters:
obj – The object whose top-level container needs to be rewritten.
- update_root()
Signal that the document catalog should be written to the output. Equivalent to calling
mark_update()
withroot_ref
.
- register_extension(ext: DeveloperExtension)
- get_object(ido, as_metadata_stream: bool = False)
Retrieve the object associated with the provided reference from this PDF handler.
- Parameters:
ref – An instance of
generic.Reference
.as_metadata_stream – Whether to dereference the object as an XMP metadata stream.
- Returns:
A PDF object.
- allocate_placeholder() IndirectObject
Allocate an object reference to populate later. Calls to
get_object()
for this reference will returnNullObject
until it is populated usingadd_object()
.This method is only relevant in certain advanced contexts where an object ID needs to be known before the object it refers to can be built; chances are you’ll never need it.
- Returns:
A
IndirectObject
instance referring to the object just allocated.
- add_object(obj, obj_stream: ObjectStream | None = None, idnum=None) IndirectObject
Add a new object to this writer.
- Parameters:
obj – The object to add.
obj_stream – An object stream to add the object to.
idnum – Manually specify the object ID of the object to be added. This is only allowed for object IDs that have previously been allocated using
allocate_placeholder()
.
- Returns:
A
IndirectObject
instance referring to the object just added.
- prepare_object_stream(compress=True)
Prepare and return a new
ObjectStream
object.- Parameters:
compress – Indicates whether the resulting object stream should be compressed.
- Returns:
An
ObjectStream
object.
- property trailer_view: DictionaryObject
Returns a view of the document trailer of the document represented by this
PdfHandler
instance.The view is effectively read-only, in the sense that any writes will not be reflected in the actual trailer (if the handler supports writing, that is).
- Returns:
A
generic.DictionaryObject
representing the current state of the document trailer.
- write(stream)
Write the contents of this PDF writer to a stream.
- Parameters:
stream – A writable output stream.
- register_annotation(page_ref, annot_ref)
Register an annotation to be added to a page. This convenience function takes care of calling
mark_update()
where necessary.- Parameters:
page_ref – Reference to the page object involved.
annot_ref – Reference to the annotation object to be added.
- insert_page(new_page, after=None)
Insert a page object into the tree.
- Parameters:
new_page – Page object to insert.
after – Page number (zero-indexed) after which to insert the page.
- Returns:
A reference to the newly inserted page.
- import_object(obj: PdfObject, obj_stream: ObjectStream | None = None) PdfObject
Deep-copy an object into this writer, dealing with resolving indirect references in the process.
Danger
The table mapping indirect references in the input to indirect references in the writer is not preserved between calls. Concretely, this means that invoking
import_object()
twice on the same input reader may cause object duplication.- Parameters:
obj – The object to import.
obj_stream –
The object stream to import objects into.
Note
Stream objects and bare references will not be put into the object stream; the standard forbids this.
- Returns:
The object as associated with this writer. If the input object was an indirect reference, a dictionary (incl. streams) or an array, the returned value will always be a new instance.
- import_page_as_xobject(other: PdfHandler, page_ix=0, inherit_filters=True)
Import a page content stream from some other
PdfHandler
into the current one as a form XObject.- Parameters:
other – A
PdfHandler
page_ix – Index of the page to copy (default: 0)
inherit_filters – Inherit the content stream’s filters, if present.
- Returns:
An
IndirectObject
referring to the page object as added to the current reader.
- add_stream_to_page(page_ix, stream_ref, resources=None, prepend=False)
Append an indirect stream object to a page in a PDF as a content stream.
- Parameters:
page_ix – Index of the page to modify. The first page has index 0.
stream_ref –
IndirectObject
reference to the stream object to add.resources – Resource dictionary containing resources to add to the page’s existing resource dictionary.
prepend – Prepend the content stream to the list of content streams, as opposed to appending it to the end. This has the effect of causing the stream to be rendered underneath the already existing content on the page.
- Returns:
An
IndirectObject
reference to the page object that was modified.
- merge_resources(orig_dict: DictionaryObject, new_dict: DictionaryObject) bool
Update an existing resource dictionary object with data from another one. Returns
True
if the original dict object was modified directly.The caller is responsible for avoiding name conflicts with existing resources.
- class pyhanko.pdf_utils.writer.PageObject(contents, media_box, resources=None)
Bases:
DictionaryObject
Subclass of
DictionaryObject
that handles some of the initialisation boilerplate for page objects.
- class pyhanko.pdf_utils.writer.PdfFileWriter(stream_xrefs=True, init_page_tree=True, info=None)
Bases:
BasePdfFileWriter
Class to write new PDF files.
- encrypt(owner_pass, user_pass=None, **kwargs)
Mark this document to be encrypted with PDF 2.0 encryption (AES-256).
Caution
While pyHanko supports legacy PDF encryption as well, the API to create new documents using outdated encryption is left largely undocumented on purpose to discourage its use.
This caveat does not apply to incremental updates added to existing documents.
Danger
The PDF 2.0 standard mandates AES-256 in CBC mode, and also includes 12 bytes of known plaintext by design. This implies that a sufficiently knowledgeable attacker can inject arbitrary content into your encrypted files without knowledge of the password.
Adding a digital signature to the encrypted document is not a foolproof way to deal with this either, since most viewers will still allow the document to be opened before signatures are validated, and therefore end users are still exposed to potentially malicious content.
Until the standard supports authenticated encryption schemes, you should never rely on its encryption provisions if tampering is a concern.
- Parameters:
owner_pass – The desired owner password.
user_pass – The desired user password (defaults to the owner password if not specified)
kwargs – Other keyword arguments to be passed to
StandardSecurityHandler.build_from_pw()
.
- encrypt_pubkey(recipients: List[Certificate], **kwargs)
Mark this document to be encrypted with PDF 2.0 public key encryption. The certificates passed in should be RSA certificates.
PyHanko defaults to AES-256 to encrypt the actual file contents. The seed used to derive the file encryption key is also encrypted using AES-256 and bundled in a CMS EnvelopedData object. The envelope key is then encrypted separately for each recipient, using their respective public keys.
Caution
The caveats for
encrypt()
also apply here.- Parameters:
recipients – Certificates of the recipients that should be able to decrypt the document.
kwargs – Other keyword arguments to be passed to
PubKeySecurityHandler.build_from_certs()
.
- set_custom_trailer_entry(key: NameObject, value: PdfObject)
Set a custom, unmanaged entry in the document trailer or cross-reference stream dictionary.
Warning
Calling this method to set an entry that is managed by pyHanko internally (info dictionary, document catalog, etc.) has undefined results.
- Parameters:
key – Dictionary key to use in the trailer.
value – Value to set
- pyhanko.pdf_utils.writer.init_xobject_dictionary(command_stream: bytes, box_width, box_height, resources: DictionaryObject | None = None) StreamObject
Helper function to initialise form XObject dictionaries.
Note
For utilities to handle image XObjects, see
images
.- Parameters:
command_stream – The XObject’s raw appearance stream.
box_width – The width of the XObject’s bounding box.
box_height – The height of the XObject’s bounding box.
resources – A resource dictionary to include with the form object.
- Returns:
A
StreamObject
representation of the form XObject.
- pyhanko.pdf_utils.writer.copy_into_new_writer(input_handler: PdfHandler, writer_kwargs: dict | None = None) PdfFileWriter
Copy all objects in a given PDF handler into a new
PdfFileWriter
. This operation will attempt to preserve the document catalog of the originalinput_handler
.Very roughly, calling this function and then immediately invoking
write()
on the resulting writer should result in an equivalent document as far as presentation is concerned. As a general rule, behaviour that is controlled from outside the document catalog (e.g. encryption) or that requires byte-for-byte equivalence with the original (e.g. digital signatures) will not survive this translation.- Parameters:
input_handler –
PdfHandler
to source objects from.writer_kwargs – Keyword arguments to pass to the writer.
- Returns:
New
PdfFileWriter
containing all objects from the input handler.
pyhanko.pdf_utils.xref module
Internal utilities to handle the processing of cross-reference data and document trailer data.
This entire module is considered internal API.
- class pyhanko.pdf_utils.xref.XRefCache(reader, xref_sections: List[XRefSection])
Bases:
object
Internal class to parse & store information from the xref section(s) of a PDF document.
Stores both the most recent status of all xrefs in addition to their historical values.
All members of this class are considered internal API and are subject to change without notice.
- property total_revisions
- object_streams_used_in(revision)
- get_xref_container_info(revision) XRefSectionMetaInfo
- get_xref_data(revision) XRefSectionData
- explicit_refs_in_revision(revision) Set[Reference]
Look up the object refs for all objects explicitly added or overwritten in a given revision.
- Parameters:
revision – A revision number. The oldest revision is zero.
- Returns:
A set of Reference objects.
- refs_freed_in_revision(revision) Set[Reference]
Look up the object refs for all objects explicitly freed in a given revision.
- Parameters:
revision – A revision number. The oldest revision is zero.
- Returns:
A set of Reference objects.
- get_startxref_for_revision(revision) int
Look up the location of the XRef table/stream associated with a specific revision, as indicated by startxref or /Prev.
- Parameters:
revision – A revision number. The oldest revision is zero.
- Returns:
An integer pointer
- get_historical_ref(ref, revision) int | ObjStreamRef | None
Look up the location of the historical value of an object.
Note
This method is not suitable for determining whether or not a particular object ID is available in a given revision, since it treats unused objects and freed objects the same way.
- Parameters:
ref – An object reference.
revision – A revision number. The oldest revision is zero.
- Returns:
An integer offset, an object stream reference, or
None
if the reference does not resolve in the specified revision.
- property hybrid_xrefs_present: bool
Determine if a file uses hybrid references anywhere.
- Returns:
True
if hybrid references were detected,False
otherwise.
- class pyhanko.pdf_utils.xref.XRefBuilder(handler: PdfHandler, stream, strict: bool, last_startxref: int)
Bases:
object
- err_limit = 10
- read_xrefs()
- class pyhanko.pdf_utils.xref.XRefType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)
Bases:
Enum
Different types of cross-reference entries.
- FREE = 1
A freeing instruction.
- STANDARD = 2
A regular top-level object.
- IN_OBJ_STREAM = 3
An object that is part of an object stream.
- class pyhanko.pdf_utils.xref.XRefEntry(xref_type: XRefType, location: int | ObjStreamRef | None, idnum: int, generation: int = 0)
Bases:
object
Value type representing a single cross-reference entry.
- location: int | ObjStreamRef | None
Location the cross-reference points to.
- idnum: int
The ID of the object being referenced.
- generation: int = 0
The generation number of the object being referenced.
- class pyhanko.pdf_utils.xref.ObjStreamRef(obj_stream_id: int, ix_in_stream: int)
Bases:
object
Identifies an object that’s part of an object stream.
- obj_stream_id: int
The ID number of the object stream (its generation number is presumed zero).
- ix_in_stream: int
The index of the object in the stream.
- exception pyhanko.pdf_utils.xref.ObjectHeaderReadError(msg: str, *args)
Bases:
PdfReadError
- class pyhanko.pdf_utils.xref.XRefSection(meta_info: XRefSectionMetaInfo, xref_data: XRefSectionData)
Bases:
object
Describes a cross-reference section and describes how it is serialised into the PDF file.
- meta_info: XRefSectionMetaInfo
Metadata about the cross-reference section.
- xref_data: XRefSectionData
A description of the actual object pointer definitions.
- class pyhanko.pdf_utils.xref.XRefSectionData
Bases:
object
Internal class for bookkeeping on a single cross-reference section, independently of the others.
- try_resolve(ref: Reference | IndirectObject) int | ObjStreamRef | None
- process_hybrid_entries(entries: Iterator[XRefEntry], xref_meta_info: XRefSectionMetaInfo, strict: bool)
- higher_generation_refs()
- class pyhanko.pdf_utils.xref.XRefSectionType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)
Bases:
Enum
- STANDARD = 1
- STREAM = 2
- HYBRID_MAIN = 3
- HYBRID_STREAM = 4
- class pyhanko.pdf_utils.xref.XRefSectionMetaInfo(xref_section_type: pyhanko.pdf_utils.xref.XRefSectionType, size: int, declared_startxref: int, start_location: int, end_location: int, stream_ref: pyhanko.pdf_utils.generic.Reference | None)
Bases:
object
- xref_section_type: XRefSectionType
The type of cross-reference section.
- size: int
The highest object ID in scope for this xref section.
- declared_startxref: int
Location pointed to by the startxref pointer in that revision.
- start_location: int
Actual start location of the xref data. This should be equal to declared_startxref, but in broken files that may not be the case.
- end_location: int
Location where the xref data ended.
- class pyhanko.pdf_utils.xref.TrailerDictionary
Bases:
PdfObject
The standard mandates that each trailer shall contain at least all keys used in the preceding trailer, even if unmodified. Of course, we cannot trust documents to actually follow this rule, so this class implements fallbacks.
- non_trailer_keys = {'/DecodeParms', '/Filter', '/Index', '/Length', '/Type', '/W', '/XRefStm'}
- add_trailer_revision(trailer_dict: DictionaryObject)
- raw_get(key, decrypt: EncryptedObjAccess = EncryptedObjAccess.TRANSPARENT, revision=None)
- flatten(revision=None) DictionaryObject
- keys()
- items()
- write_to_stream(stream, handler=None, container_ref=None)
Abstract method to render this object to an output stream.
- Parameters:
stream – An output stream.
container_ref – Local encryption key.
handler – Security handler
- pyhanko.pdf_utils.xref.read_object_header(stream, strict)
- pyhanko.pdf_utils.xref.parse_xref_stream(xref_stream: StreamObject, strict: bool = True) Iterator[XRefEntry]
Parse a single cross-reference stream and yield its entries one by one.
This is internal API.
- Parameters:
xref_stream – A
StreamObject
.strict – Boolean indicating whether we’re running in strict mode.
- Returns:
A generator object yielding
XRefEntry
objects.
- pyhanko.pdf_utils.xref.parse_xref_table(stream) Iterator[XRefEntry]
Parse a single cross-reference table and yield its entries one by one.
This is internal API.
- Parameters:
stream – A file-like object pointed to the start of the cross-reference table.
- Returns:
A generator object yielding
XRefEntry
objects.
- pyhanko.pdf_utils.xref.write_xref_table(stream, position_dict: Dict[Tuple[int, int], int])
- class pyhanko.pdf_utils.xref.ObjectStream(compress=True)
Bases:
object
Utility class to collect objects into a PDF object stream.
Object streams are mainly useful for space efficiency reasons. They allow related objects to be grouped & compressed together in a more flexible manner.
Warning
Object streams can only be used in files with a cross-reference stream, as opposed to a classical XRef table. In particular, this means that incremental updates to files with a legacy XRef table cannot contain object streams either. See § 7.5.7 in ISO 32000-1 for further details.
Danger
Use
BasePdfFileWriter.prepare_object_stream()
to create instances of object streams. The __init__ function is internal API.- add_object(idnum: int, obj: PdfObject)
Add an object to an object stream. Note that objects in object streams always have their generation number set to 0 by definition.
- Parameters:
idnum – The object’s ID number.
obj – The object to embed into the object stream.
- Raises:
TypeError – Raised if
obj
is an instance ofStreamObject
orIndirectObject
.
- as_pdf_object() StreamObject
Render the object stream to a PDF stream object
- Returns:
An instance of
StreamObject
.
- class pyhanko.pdf_utils.xref.XRefStream(position_dict: Dict[Tuple[int, int], int | Tuple[int, int]])
Bases:
StreamObject
- write_to_stream(stream, handler=None, container_ref=None)
Abstract method to render this object to an output stream.
- Parameters:
stream – An output stream.
container_ref – Local encryption key.
handler – Security handler