pyhanko.pdf_utils.reader module

Utility to read PDF files. Contains code from the PyPDF2 project; see here for the original license.

The implementation was tweaked with the express purpose of facilitating historical inspection and auditing of PDF files with multiple revisions through incremental updates. This comes at a cost, and future iterations of this module may offer more flexibility in terms of the level of detail with which file size is scrutinised.

class pyhanko.pdf_utils.reader.PdfFileReader(stream, strict=True)

Bases: pyhanko.pdf_utils.rw_common.PdfHandler

Class implementing functionality to read a PDF file and cache certain data about it.

last_startxref = None
has_xref_stream = False
property trailer_view

Returns a view of the document trailer of the document represented by this PdfHandler instance.

The view is effectively read-only, in the sense that any writes will not be reflected in the actual trailer (if the handler supports writing, that is).

Returns

A generic.DictionaryObject representing the current state of the document trailer.

property root_ref
Returns

A reference to the document catalog of this PDF handler.

property document_id
get_historical_root(revision: int)

Get the document catalog for a specific revision.

Parameters

revision – The revision to query, the oldest one being 0.

Returns

The value of the document catalog dictionary for that revision.

property total_revisions
Returns

The total number of revisions made to this file.

get_object(ref, revision=None, never_decrypt=False, transparent_decrypt=True)

Read an object from the input stream.

Parameters
  • refReference to the object.

  • revision – Revision number, to return the historical value of a reference. This always bypasses the cache. The oldest revision is numbered 0. See also HistoricalResolver.

  • never_decrypt – Skip decryption step (only needed for parsing /Encrypt)

  • transparent_decrypt

    If True, all encrypted objects are transparently decrypted by default (in the sense that a user of the API in a PyPDF2 compatible way would only “see” decrypted objects). If False, this method may return a proxy object that still allows access to the “original”.

    Danger

    The encryption parameters are considered internal, undocumented API, and subject to change without notice.

Returns

A PdfObject.

Raises

PdfReadError – Raised if there is an issue reading the object from the file.

cache_get_indirect_object(generation, idnum)
cache_indirect_object(generation, idnum, obj)
read()
decrypt(password: Union[str, bytes])

When using an encrypted PDF file with the standard PDF encryption handler, this function will allow the file to be decrypted. It checks the given password against the document’s user password and owner password, and then stores the resulting decryption key if either password is correct.

Both legacy encryption schemes and PDF 2.0 encryption (based on AES-256) are supported.

Danger

Supplying either user or owner password will work. Cryptographically, both allow the decryption key to be computed, but processors are expected to adhere to the /P flags in the encryption dictionary when accessing a file with the user password. Currently, pyHanko does not enforce these restrictions, but it may in the future.

Danger

One should also be aware that the legacy encryption schemes used prior to PDF 2.0 are (very) weak, and we only support them for compatibility reasons. Under no circumstances should these still be used to encrypt new files.

Parameters

password – The password to match.

decrypt_pubkey(credential: pyhanko.pdf_utils.crypt.EnvelopeKeyDecrypter)

Decrypt a PDF file encrypted using public-key encryption by providing a credential representing the private key of one of the recipients.

Danger

The same caveats as in decrypt() w.r.t. permission handling apply to this method.

Danger

The robustness of the public key cipher being used is not the only factor in the security of public-key encryption in PDF. The standard still permits weak schemes to encrypt the actual file data and file keys. PyHanko uses sane defaults everywhere, but other software may not.

Parameters

credential – The EnvelopeKeyDecrypter handling the recipient’s private key.

property encrypted
Returns

True if a document is encrypted, False otherwise.

get_historical_resolver(revision: int)pyhanko.pdf_utils.reader.HistoricalResolver

Return a PdfHandler instance that provides a view on the file at a specific revision.

Parameters

revision – The revision number to use, with 0 being the oldest.

Returns

An instance of HistoricalResolver.

property embedded_signatures
Returns

The signatures embedded in this document, in signing order; see EmbeddedPdfSignature.

class pyhanko.pdf_utils.reader.HistoricalResolver(reader: pyhanko.pdf_utils.reader.PdfFileReader, revision)

Bases: pyhanko.pdf_utils.rw_common.PdfHandler

PdfHandler implementation that provides a view on a particular revision of a PDF file.

Instances of HistoricalResolver should be created by calling the get_historical_resolver() method on a PdfFileReader object.

Instances of this class cache the result of get_object() calls.

Note

Be aware that instances of this class transparently rewrite the PDF handler associated with any reference objects returned from the reader, so calling get_object() on an indirect reference object will cause the reference to be resolved within the selected revision.

property document_id
property trailer_view

Returns a view of the document trailer of the document represented by this PdfHandler instance.

The view is effectively read-only, in the sense that any writes will not be reflected in the actual trailer (if the handler supports writing, that is).

Returns

A generic.DictionaryObject representing the current state of the document trailer.

get_object(ref: pyhanko.pdf_utils.generic.Reference)

Retrieve the object associated with the provided reference from this PDF handler.

Parameters

ref – An instance of generic.Reference.

Returns

A PDF object.

property root_ref
Returns

A reference to the document catalog of this PDF handler.

explicit_refs_in_revision()
refs_freed_in_revision()
object_streams_used()
is_ref_available(ref: pyhanko.pdf_utils.generic.Reference)bool

Check if the reference in question would already point to an object in this revision.

Parameters

ref – A reference object (usually one written to by a by a newer revision)

Returns

True if the reference is undefined, False otherwise.

collect_dependencies(obj: pyhanko.pdf_utils.generic.PdfObject, since_revision=None)

Collect all indirect references used by an object and its descendants.

Parameters
  • obj – The object to inspect.

  • since_revision

    Optionally specify a revision number that tells the scanner to only include objects IDs that were added in that revision or later.

    Warning

    In particular, this means that the scanner will not recurse into older objects either.

Returns

A set of Reference objects.