pyhanko.pdf_utils.reader module

Utility to read PDF files. Contains code from the PyPDF2 project; see here for the original license.

The implementation was tweaked with the express purpose of facilitating historical inspection and auditing of PDF files with multiple revisions through incremental updates. This comes at a cost, and future iterations of this module may offer more flexibility in terms of the level of detail with which file size is scrutinised.

class pyhanko.pdf_utils.reader.PdfFileReader(stream, strict=True)

Bases: pyhanko.pdf_utils.rw_common.PdfHandler

Class implementing functionality to read a PDF file and cache certain data about it.

last_startxref = None
has_xref_stream = False
property root_ref
Returns

A reference to the document catalog of this PDF handler.

get_historical_root(revision: int)

Get the document catalog for a specific revision.

Parameters

revision – The revision to query, the oldest one being 0.

Returns

The value of the document catalog dictionary for that revision.

property total_revisions
Returns

The total number of revisions made to this file.

get_object(ref, revision=None, never_decrypt=False, transparent_decrypt=True)

Read an object from the input stream.

Parameters
  • refReference to the object.

  • revision – Revision number, to return the historical value of a reference. This always bypasses the cache. The oldest revision is numbered 0. See also HistoricalResolver.

  • never_decrypt – Skip decryption step (only needed for parsing /Encrypt)

  • transparent_decrypt

    If True, all encrypted objects are transparently decrypted by default (in the sense that a user of the API in a PyPDF2 compatible way would only “see” decrypted objects). If False, this method may return a proxy object that still allows access to the “original”.

    Danger

    The encryption parameters are considered internal, undocumented API, and subject to change without notice.

Returns

A PdfObject.

Raises

PdfReadError – Raised if there is an issue reading the object from the file.

cache_get_indirect_object(generation, idnum)
cache_indirect_object(generation, idnum, obj)
read()
decrypt(password: bytes) → int

When using an encrypted PDF file with the PDF legacy RC4-based encryption handler, this function will allow the file to be decrypted. It checks the given password against the document’s user password and owner password, and then stores the resulting decryption key if either password is correct.

Supplying either user or owner password will work.

Danger

One should also be aware that the encryption scheme implemented here is (very) weak, and we only support it for compatibility reasons. Under no circumstances should it still be used to encrypt new files.

Parameters

password – The password to match.

Returns

0 if the password failed, 1 if the password matched the user password, and 2 if the password matched the owner password.

Raises

NotImplementedError – Raised if the document uses an unsupported encryption method.

property encrypted
Returns

True if a document is encrypted, False otherwise.

get_historical_resolver(revision: int)pyhanko.pdf_utils.reader.HistoricalResolver

Return a PdfHandler instance that provides a view on the file at a specific revision.

Parameters

revision – The revision number to use, with 0 being the oldest.

Returns

An instance of HistoricalResolver.

property embedded_signatures
Returns

The signatures embedded in this document, in signing order; see EmbeddedPdfSignature.

class pyhanko.pdf_utils.reader.HistoricalResolver(reader: pyhanko.pdf_utils.reader.PdfFileReader, revision)

Bases: pyhanko.pdf_utils.rw_common.PdfHandler

PdfHandler implementation that provides a view on a particular revision of a PDF file.

Instances of HistoricalResolver should be created by calling the get_historical_resolver() method on a PdfFileReader object.

Instances of this class cache the result of get_object() calls.

Note

Be aware that instances of this class transparently rewrite the PDF handler associated with any reference objects returned from the reader, so calling get_object() on an indirect reference object will cause the reference to be resolved within the selected revision.

get_object(ref: pyhanko.pdf_utils.generic.Reference)

Retrieve the object associated with the provided reference from this PDF handler.

Parameters

ref – An instance of generic.Reference.

Returns

A PDF object.

property root_ref
Returns

A reference to the document catalog of this PDF handler.

collect_dependencies(obj: pyhanko.pdf_utils.generic.PdfObject, since_revision=None)

Collect all indirect references used by an object and its descendants.

Parameters
  • obj – The object to inspect.

  • since_revision

    Optionally specify a revision number that tells the scanner to only include objects IDs that were added in that revision or later.

    Warning

    In particular, this means that the scanner will not recurse into older objects either.

Returns

A set of Reference objects.