pyhanko.pdf_utils.reader module¶
Utility to read PDF files. Contains code from the PyPDF2 project; see here for the original license.
The implementation was tweaked with the express purpose of facilitating historical inspection and auditing of PDF files with multiple revisions through incremental updates. This comes at a cost, and future iterations of this module may offer more flexibility in terms of the level of detail with which file size is scrutinised.
-
class
pyhanko.pdf_utils.reader.
PdfFileReader
(stream, strict=True)¶ Bases:
pyhanko.pdf_utils.rw_common.PdfHandler
Class implementing functionality to read a PDF file and cache certain data about it.
-
last_startxref
= None¶
-
has_xref_stream
= False¶
-
property
root_ref
¶ - Returns
A reference to the document catalog of this PDF handler.
-
get_historical_root
(revision: int)¶ Get the document catalog for a specific revision.
- Parameters
revision – The revision to query, the oldest one being 0.
- Returns
The value of the document catalog dictionary for that revision.
-
property
total_revisions
¶ - Returns
The total number of revisions made to this file.
-
get_object
(ref, revision=None, never_decrypt=False, transparent_decrypt=True)¶ Read an object from the input stream.
- Parameters
ref –
Reference
to the object.revision – Revision number, to return the historical value of a reference. This always bypasses the cache. The oldest revision is numbered 0. See also
HistoricalResolver
.never_decrypt – Skip decryption step (only needed for parsing
/Encrypt
)transparent_decrypt –
If
True
, all encrypted objects are transparently decrypted by default (in the sense that a user of the API in a PyPDF2 compatible way would only “see” decrypted objects). IfFalse
, this method may return a proxy object that still allows access to the “original”.Danger
The encryption parameters are considered internal, undocumented API, and subject to change without notice.
- Returns
A
PdfObject
.- Raises
PdfReadError – Raised if there is an issue reading the object from the file.
-
cache_get_indirect_object
(generation, idnum)¶
-
cache_indirect_object
(generation, idnum, obj)¶
-
read
()¶
-
decrypt
(password: bytes) → int¶ When using an encrypted PDF file with the PDF legacy RC4-based encryption handler, this function will allow the file to be decrypted. It checks the given password against the document’s user password and owner password, and then stores the resulting decryption key if either password is correct.
Supplying either user or owner password will work.
Danger
One should also be aware that the encryption scheme implemented here is (very) weak, and we only support it for compatibility reasons. Under no circumstances should it still be used to encrypt new files.
- Parameters
password – The password to match.
- Returns
0
if the password failed,1
if the password matched the user password, and2
if the password matched the owner password.- Raises
NotImplementedError – Raised if the document uses an unsupported encryption method.
-
property
encrypted
¶ - Returns
True
if a document is encrypted,False
otherwise.
-
get_historical_resolver
(revision: int) → pyhanko.pdf_utils.reader.HistoricalResolver¶ Return a
PdfHandler
instance that provides a view on the file at a specific revision.- Parameters
revision – The revision number to use, with 0 being the oldest.
- Returns
An instance of
HistoricalResolver
.
-
property
embedded_signatures
¶ - Returns
The signatures embedded in this document, in signing order; see
EmbeddedPdfSignature
.
-
-
class
pyhanko.pdf_utils.reader.
HistoricalResolver
(reader: pyhanko.pdf_utils.reader.PdfFileReader, revision)¶ Bases:
pyhanko.pdf_utils.rw_common.PdfHandler
PdfHandler
implementation that provides a view on a particular revision of a PDF file.Instances of
HistoricalResolver
should be created by calling theget_historical_resolver()
method on aPdfFileReader
object.Instances of this class cache the result of
get_object()
calls.Note
Be aware that instances of this class transparently rewrite the PDF handler associated with any reference objects returned from the reader, so calling
get_object()
on an indirect reference object will cause the reference to be resolved within the selected revision.-
get_object
(ref: pyhanko.pdf_utils.generic.Reference)¶ Retrieve the object associated with the provided reference from this PDF handler.
- Parameters
ref – An instance of
generic.Reference
.- Returns
A PDF object.
-
property
root_ref
¶ - Returns
A reference to the document catalog of this PDF handler.
-
collect_dependencies
(obj: pyhanko.pdf_utils.generic.PdfObject, since_revision=None)¶ Collect all indirect references used by an object and its descendants.
- Parameters
obj – The object to inspect.
since_revision –
Optionally specify a revision number that tells the scanner to only include objects IDs that were added in that revision or later.
Warning
In particular, this means that the scanner will not recurse into older objects either.
- Returns
A
set
ofReference
objects.
-