pyhanko.pdf_utils.reader module

Utility to read PDF files. Contains code from the PyPDF2 project; see here for the original license.

The implementation was tweaked with the express purpose of facilitating historical inspection and auditing of PDF files with multiple revisions through incremental updates. This comes at a cost, and future iterations of this module may offer more flexibility in terms of the level of detail with which file size is scrutinised.

class pyhanko.pdf_utils.reader.PdfFileReader(stream, strict=True)

Bases: PdfHandler

Class implementing functionality to read a PDF file and cache certain data about it.

last_startxref = None
has_xref_stream = False
xrefs: XRefCache
property document_meta_view: DocumentMetadata
property input_version
property trailer_view: DictionaryObject

Returns a view of the document trailer of the document represented by this PdfHandler instance.

The view is effectively read-only, in the sense that any writes will not be reflected in the actual trailer (if the handler supports writing, that is).

Returns

A generic.DictionaryObject representing the current state of the document trailer.

property root_ref: Reference
Returns

A reference to the document catalog of this PDF handler.

property document_id: Tuple[bytes, bytes]
get_historical_root(revision: int)

Get the document catalog for a specific revision.

Parameters

revision – The revision to query, the oldest one being 0.

Returns

The value of the document catalog dictionary for that revision.

property total_revisions: int
Returns

The total number of revisions made to this file.

get_object(ref, revision=None, never_decrypt=False, transparent_decrypt=True, as_metadata_stream=False)

Read an object from the input stream.

Parameters
  • refReference to the object.

  • revision – Revision number, to return the historical value of a reference. This always bypasses the cache. The oldest revision is numbered 0. See also HistoricalResolver.

  • never_decrypt – Skip decryption step (only needed for parsing /Encrypt)

  • transparent_decrypt

    If True, all encrypted objects are transparently decrypted by default (in the sense that a user of the API in a PyPDF2 compatible way would only “see” decrypted objects). If False, this method may return a proxy object that still allows access to the “original”.

    Danger

    The encryption parameters are considered internal, undocumented API, and subject to change without notice.

  • as_metadata_stream – Whether to dereference the object as an XMP metadata stream.

Returns

A PdfObject.

Raises

PdfReadError – Raised if there is an issue reading the object from the file.

cache_get_indirect_object(generation, idnum)
cache_indirect_object(generation, idnum, obj)
read()
decrypt(password: Union[str, bytes]) AuthResult

When using an encrypted PDF file with the standard PDF encryption handler, this function will allow the file to be decrypted. It checks the given password against the document’s user password and owner password, and then stores the resulting decryption key if either password is correct.

Both legacy encryption schemes and PDF 2.0 encryption (based on AES-256) are supported.

Danger

Supplying either user or owner password will work. Cryptographically, both allow the decryption key to be computed, but processors are expected to adhere to the /P flags in the encryption dictionary when accessing a file with the user password. Currently, pyHanko does not enforce these restrictions, but it may in the future.

Danger

One should also be aware that the legacy encryption schemes used prior to PDF 2.0 are (very) weak, and we only support them for compatibility reasons. Under no circumstances should these still be used to encrypt new files.

Parameters

password – The password to match.

decrypt_pubkey(credential: EnvelopeKeyDecrypter) AuthResult

Decrypt a PDF file encrypted using public-key encryption by providing a credential representing the private key of one of the recipients.

Danger

The same caveats as in decrypt() w.r.t. permission handling apply to this method.

Danger

The robustness of the public key cipher being used is not the only factor in the security of public-key encryption in PDF. The standard still permits weak schemes to encrypt the actual file data and file keys. PyHanko uses sane defaults everywhere, but other software may not.

Parameters

credential – The EnvelopeKeyDecrypter handling the recipient’s private key.

property encrypted
Returns

True if a document is encrypted, False otherwise.

get_historical_resolver(revision: int) HistoricalResolver

Return a PdfHandler instance that provides a view on the file at a specific revision.

Parameters

revision – The revision number to use, with 0 being the oldest.

Returns

An instance of HistoricalResolver.

property embedded_signatures
Returns

The signature objects embedded in this document, in signing order; see EmbeddedPdfSignature.

property embedded_regular_signatures
Returns

The signature objects of type /Sig embedded in this document, in signing order; see EmbeddedPdfSignature.

property embedded_timestamp_signatures
Returns

The signature objects of type /DocTimeStamp embedded in this document, in signing order; see EmbeddedPdfSignature.

class pyhanko.pdf_utils.reader.HistoricalResolver(reader: PdfFileReader, revision)

Bases: PdfHandler

PdfHandler implementation that provides a view on a particular revision of a PDF file.

Instances of HistoricalResolver should be created by calling the get_historical_resolver() method on a PdfFileReader object.

Instances of this class cache the result of get_object() calls.

Danger

This class is documented, but is nevertheless considered internal API, and easy to misuse.

In particular, the container_ref attribute must not be relied upon for objects retrieved from a HistoricalResolver. Internally, it is only used to make lazy decryption work in historical revisions.

Note

Be aware that instances of this class transparently rewrite the PDF handler associated with any reference objects returned from the reader, so calling get_object() on an indirect reference object will cause the reference to be resolved within the selected revision.

property document_meta_view: DocumentMetadata
property document_id: Tuple[bytes, bytes]
property trailer_view: DictionaryObject

Returns a view of the document trailer of the document represented by this PdfHandler instance.

The view is effectively read-only, in the sense that any writes will not be reflected in the actual trailer (if the handler supports writing, that is).

Returns

A generic.DictionaryObject representing the current state of the document trailer.

get_object(ref: Reference, as_metadata_stream: bool = False)

Retrieve the object associated with the provided reference from this PDF handler.

Parameters
  • ref – An instance of generic.Reference.

  • as_metadata_stream – Whether to dereference the object as an XMP metadata stream.

Returns

A PDF object.

property root_ref: Reference
Returns

A reference to the document catalog of this PDF handler.

explicit_refs_in_revision()
refs_freed_in_revision()
object_streams_used()
is_ref_available(ref: Reference) bool

Check if the reference in question was in scope for this revision. This call doesn’t care about the specific semantics of free vs. used objects; it conservatively answers ‘no’ in any situation where the object ID _could_ have been assigned by the revision in question.

Parameters

ref – A reference object (usually one written to by a newer revision)

Returns

True if the reference is unassignable, False otherwise.

collect_dependencies(obj: PdfObject, since_revision=None)

Collect all indirect references used by an object and its descendants.

Parameters
  • obj – The object to inspect.

  • since_revision

    Optionally specify a revision number that tells the scanner to only include objects IDs that were added in that revision or later.

    Warning

    In particular, this means that the scanner will not recurse into older objects either.

Returns

A set of Reference objects.

pyhanko.pdf_utils.reader.parse_catalog_version(version_str) Optional[Tuple[int, int]]
class pyhanko.pdf_utils.reader.RawPdfPath(*path: Union[str, int])

Bases: object

Class to model raw paths in a file.

This class is internal API.

walk_nodes(from_obj, transparent_dereference=True) Generator[Tuple[Optional[Union[int, str]], PdfObject], None, None]
access_on(from_obj, dereference_last=True) PdfObject
access_reference_on(from_obj) Reference
pyhanko.pdf_utils.reader.process_data_at_eof(stream) int

Auxiliary function that reads backwards from the current position in a stream to find the EOF marker and startxref value

This is internal API.

Parameters

stream – A stream to read from

Returns

The value of the startxref pointer, if found. Otherwise a PdfReadError is raised.