pyhanko.pdf_utils.xref module

Internal utilities to handle the processing of cross-reference data and document trailer data.

This entire module is considered internal API.

class pyhanko.pdf_utils.xref.XRefCache(reader, xref_sections: List[XRefSection])

Bases: object

Internal class to parse & store information from the xref section(s) of a PDF document.

Stores both the most recent status of all xrefs in addition to their historical values.

All members of this class are considered internal API and are subject to change without notice.

property total_revisions

get_last_change(ref: Reference)

object_streams_used_in(revision)

get_introducing_revision(ref: Reference)

get_xref_container_info(revision) → XRefSectionMetaInfo

get_xref_data(revision) → XRefSectionData

explicit_refs_in_revision(revision) → Set[Reference]

Look up the object refs for all objects explicitly added or overwritten in a given revision.

Parameters: revision – A revision number. The oldest revision is zero.
Returns: A set of Reference objects.

refs_freed_in_revision(revision) → Set[Reference]

Look up the object refs for all objects explicitly freed in a given revision.

Parameters: revision – A revision number. The oldest revision is zero.
Returns: A set of Reference objects.

get_startxref_for_revision(revision) → int

Look up the location of the XRef table/stream associated with a specific revision, as indicated by startxref or /Prev.

Parameters: revision – A revision number. The oldest revision is zero.
Returns: An integer pointer

get_historical_ref(ref, revision) → Optional[Union[int, ObjStreamRef]]

Look up the location of the historical value of an object.

Note

This method is not suitable for determining whether or not a particular object ID is available in a given revision, since it treats unused objects and freed objects the same way.

Parameters

ref – An object reference.
revision – A revision number. The oldest revision is zero.

Returns

An integer offset, an object stream reference, or None if the reference does not resolve in the specified revision.

property hybrid_xrefs_present: bool

Determine if a file uses hybrid references anywhere.

Returns: True if hybrid references were detected, False otherwise.

class pyhanko.pdf_utils.xref.XRefBuilder(handler: PdfHandler, stream, strict: bool, last_startxref: int)

Bases: object

err_limit = 10

read_xrefs()

class pyhanko.pdf_utils.xref.XRefType(value)

Bases: Enum

Different types of cross-reference entries.

FREE = 1: A freeing instruction.

STANDARD = 2: A regular top-level object.

IN_OBJ_STREAM = 3: An object that is part of an object stream.

class pyhanko.pdf_utils.xref.XRefEntry(xref_type: XRefType, location: Optional[Union[int, ObjStreamRef]], idnum: int, generation: int = 0)

Bases: object

Value type representing a single cross-reference entry.

xref_type: XRefType: The type of cross-reference entry.

location: Optional[Union[int, ObjStreamRef]]: Location the cross-reference points to.

idnum: int: The ID of the object being referenced.

generation: int = 0: The generation number of the object being referenced.

class pyhanko.pdf_utils.xref.ObjStreamRef(obj_stream_id: int, ix_in_stream: int)

Bases: object

Identifies an object that’s part of an object stream.

obj_stream_id: int: The ID number of the object stream (its generation number is presumed zero).

ix_in_stream: int: The index of the object in the stream.

exception pyhanko.pdf_utils.xref.ObjectHeaderReadError(msg: str, *args): Bases: PdfReadError

class pyhanko.pdf_utils.xref.XRefSection(meta_info: XRefSectionMetaInfo, xref_data: XRefSectionData)

Bases: object

Describes a cross-reference section and describes how it is serialised into the PDF file.

meta_info: XRefSectionMetaInfo: Metadata about the cross-reference section.

xref_data: XRefSectionData: A description of the actual object pointer definitions.

class pyhanko.pdf_utils.xref.XRefSectionData

Bases: object

Internal class for bookkeeping on a single cross-reference section, independently of the others.

try_resolve(ref: Union[Reference, IndirectObject]) → Optional[Union[int, ObjStreamRef]]

process_entries(entries: Iterator[XRefEntry], strict: bool)

process_hybrid_entries(entries: Iterator[XRefEntry], xref_meta_info: XRefSectionMetaInfo, strict: bool)

higher_generation_refs()

class pyhanko.pdf_utils.xref.XRefSectionType(value)

Bases: Enum

An enumeration.

STANDARD = 1

STREAM = 2

HYBRID_MAIN = 3

HYBRID_STREAM = 4

class pyhanko.pdf_utils.xref.XRefSectionMetaInfo(xref_section_type: pyhanko.pdf_utils.xref.XRefSectionType, size: int, declared_startxref: int, start_location: int, end_location: int, stream_ref: Union[pyhanko.pdf_utils.generic.Reference, NoneType])

Bases: object

xref_section_type: XRefSectionType: The type of cross-reference section.

size: int: The highest object ID in scope for this xref section.

declared_startxref: int: Location pointed to by the startxref pointer in that revision.

start_location: int: Actual start location of the xref data. This should be equal to declared_startxref, but in broken files that may not be the case.

end_location: int: Location where the xref data ended.

stream_ref: Optional[Reference]: Reference to the relevant xref stream, if applicable.

class pyhanko.pdf_utils.xref.TrailerDictionary

Bases: PdfObject

The standard mandates that each trailer shall contain at least all keys used in the preceding trailer, even if unmodified. Of course, we cannot trust documents to actually follow this rule, so this class implements fallbacks.

non_trailer_keys = {'/DecodeParms', '/Filter', '/Index', '/Length', '/Type', '/W', '/XRefStm'}

add_trailer_revision(trailer_dict: DictionaryObject)

raw_get(key, decrypt: EncryptedObjAccess = EncryptedObjAccess.TRANSPARENT, revision=None)

flatten(revision=None) → DictionaryObject

keys()

items()

write_to_stream(stream, handler=None, container_ref=None)

Abstract method to render this object to an output stream.

Parameters

stream – An output stream.
container_ref – Local encryption key.
handler – Security handler

pyhanko.pdf_utils.xref.read_object_header(stream, strict)

pyhanko.pdf_utils.xref.parse_xref_stream(xref_stream: StreamObject, strict: bool = True) → Iterator[XRefEntry]

Parse a single cross-reference stream and yield its entries one by one.

This is internal API.

Parameters

xref_stream – A StreamObject.
strict – Boolean indicating whether we’re running in strict mode.

Returns

A generator object yielding XRefEntry objects.

pyhanko.pdf_utils.xref.parse_xref_table(stream) → Iterator[XRefEntry]

Parse a single cross-reference table and yield its entries one by one.

This is internal API.

Parameters: stream – A file-like object pointed to the start of the cross-reference table.
Returns: A generator object yielding XRefEntry objects.

pyhanko.pdf_utils.xref.write_xref_table(stream, position_dict: Dict[Tuple[int, int], int])

class pyhanko.pdf_utils.xref.ObjectStream(compress=True)

Bases: object

Utility class to collect objects into a PDF object stream.

Object streams are mainly useful for space efficiency reasons. They allow related objects to be grouped & compressed together in a more flexible manner.

Warning

Object streams can only be used in files with a cross-reference stream, as opposed to a classical XRef table. In particular, this means that incremental updates to files with a legacy XRef table cannot contain object streams either. See § 7.5.7 in ISO 32000-1 for further details.

Danger

Use BasePdfFileWriter.prepare_object_stream() to create instances of object streams. The __init__ function is internal API.

add_object(idnum: int, obj: PdfObject)

Add an object to an object stream. Note that objects in object streams always have their generation number set to 0 by definition.

Parameters

idnum – The object’s ID number.
obj – The object to embed into the object stream.

Raises

TypeError – Raised if obj is an instance of StreamObject or IndirectObject.

as_pdf_object() → StreamObject

Render the object stream to a PDF stream object

Returns: An instance of StreamObject.

class pyhanko.pdf_utils.xref.XRefStream(position_dict: Dict[Tuple[int, int], Union[int, Tuple[int, int]]])

Bases: StreamObject

write_to_stream(stream, handler=None, container_ref=None)

Abstract method to render this object to an output stream.

Parameters

stream – An output stream.
container_ref – Local encryption key.
handler – Security handler