pyhanko.pdf_utils.xref module

Internal utilities to handle the processing of cross-reference data and document trailer data.

This entire module is considered internal API.

class pyhanko.pdf_utils.xref.XRefCache(reader, xref_sections: List[XRefSection])

Bases: object

Internal class to parse & store information from the xref section(s) of a PDF document.

Stores both the most recent status of all xrefs in addition to their historical values.

All members of this class are considered internal API and are subject to change without notice.

property total_revisions
get_last_change(ref: Reference)
object_streams_used_in(revision)
get_introducing_revision(ref: Reference)
get_xref_container_info(revision) XRefSectionMetaInfo
get_xref_data(revision) XRefSectionData
explicit_refs_in_revision(revision) Set[Reference]

Look up the object refs for all objects explicitly added or overwritten in a given revision.

Parameters

revision – A revision number. The oldest revision is zero.

Returns

A set of Reference objects.

refs_freed_in_revision(revision) Set[Reference]

Look up the object refs for all objects explicitly freed in a given revision.

Parameters

revision – A revision number. The oldest revision is zero.

Returns

A set of Reference objects.

get_startxref_for_revision(revision) int

Look up the location of the XRef table/stream associated with a specific revision, as indicated by startxref or /Prev.

Parameters

revision – A revision number. The oldest revision is zero.

Returns

An integer pointer

get_historical_ref(ref, revision) Optional[Union[int, ObjStreamRef]]

Look up the location of the historical value of an object.

Note

This method is not suitable for determining whether or not a particular object ID is available in a given revision, since it treats unused objects and freed objects the same way.

Parameters
  • ref – An object reference.

  • revision – A revision number. The oldest revision is zero.

Returns

An integer offset, an object stream reference, or None if the reference does not resolve in the specified revision.

property hybrid_xrefs_present: bool

Determine if a file uses hybrid references anywhere.

Returns

True if hybrid references were detected, False otherwise.

class pyhanko.pdf_utils.xref.XRefBuilder(handler: PdfHandler, stream, strict: bool, last_startxref: int)

Bases: object

err_limit = 10
read_xrefs()
class pyhanko.pdf_utils.xref.XRefType(value)

Bases: Enum

Different types of cross-reference entries.

FREE = 1

A freeing instruction.

STANDARD = 2

A regular top-level object.

IN_OBJ_STREAM = 3

An object that is part of an object stream.

class pyhanko.pdf_utils.xref.XRefEntry(xref_type: XRefType, location: Optional[Union[int, ObjStreamRef]], idnum: int, generation: int = 0)

Bases: object

Value type representing a single cross-reference entry.

xref_type: XRefType

The type of cross-reference entry.

location: Optional[Union[int, ObjStreamRef]]

Location the cross-reference points to.

idnum: int

The ID of the object being referenced.

generation: int = 0

The generation number of the object being referenced.

class pyhanko.pdf_utils.xref.ObjStreamRef(obj_stream_id: int, ix_in_stream: int)

Bases: object

Identifies an object that’s part of an object stream.

obj_stream_id: int

The ID number of the object stream (its generation number is presumed zero).

ix_in_stream: int

The index of the object in the stream.

exception pyhanko.pdf_utils.xref.ObjectHeaderReadError

Bases: PdfReadError

class pyhanko.pdf_utils.xref.XRefSection(meta_info: XRefSectionMetaInfo, xref_data: XRefSectionData)

Bases: object

Describes a cross-reference section and describes how it is serialised into the PDF file.

meta_info: XRefSectionMetaInfo

Metadata about the cross-reference section.

xref_data: XRefSectionData

A description of the actual object pointer definitions.

class pyhanko.pdf_utils.xref.XRefSectionData

Bases: object

Internal class for bookkeeping on a single cross-reference section, independently of the others.

try_resolve(ref: Union[Reference, IndirectObject]) Optional[Union[int, ObjStreamRef]]
process_entries(entries: Iterator[XRefEntry], strict: bool)
process_hybrid_entries(entries: Iterator[XRefEntry], xref_meta_info: XRefSectionMetaInfo, strict: bool)
higher_generation_refs()
class pyhanko.pdf_utils.xref.XRefSectionType(value)

Bases: Enum

An enumeration.

STANDARD = 1
STREAM = 2
HYBRID_MAIN = 3
HYBRID_STREAM = 4
class pyhanko.pdf_utils.xref.XRefSectionMetaInfo(xref_section_type: pyhanko.pdf_utils.xref.XRefSectionType, size: int, declared_startxref: int, start_location: int, end_location: int, stream_ref: Union[pyhanko.pdf_utils.generic.Reference, NoneType])

Bases: object

xref_section_type: XRefSectionType

The type of cross-reference section.

size: int

The highest object ID in scope for this xref section.

declared_startxref: int

Location pointed to by the startxref pointer in that revision.

start_location: int

Actual start location of the xref data. This should be equal to declared_startxref, but in broken files that may not be the case.

end_location: int

Location where the xref data ended.

stream_ref: Optional[Reference]

Reference to the relevant xref stream, if applicable.

class pyhanko.pdf_utils.xref.TrailerDictionary

Bases: PdfObject

The standard mandates that each trailer shall contain at least all keys used in the preceding trailer, even if unmodified. Of course, we cannot trust documents to actually follow this rule, so this class implements fallbacks.

non_trailer_keys = {'/DecodeParms', '/Filter', '/Index', '/Length', '/Type', '/W', '/XRefStm'}
add_trailer_revision(trailer_dict: DictionaryObject)
raw_get(key, decrypt=True, revision=None)
flatten(revision=None) DictionaryObject
keys()
items()
write_to_stream(stream, handler=None, container_ref=None)

Abstract method to render this object to an output stream.

Parameters
  • stream – An output stream.

  • container_ref – Local encryption key.

  • handler – Security handler

pyhanko.pdf_utils.xref.read_object_header(stream, strict)
pyhanko.pdf_utils.xref.parse_xref_stream(xref_stream: StreamObject, strict: bool = True) Iterator[XRefEntry]

Parse a single cross-reference stream and yield its entries one by one.

This is internal API.

Parameters
  • xref_stream – A StreamObject.

  • strict – Boolean indicating whether we’re running in strict mode.

Returns

A generator object yielding XRefEntry objects.

pyhanko.pdf_utils.xref.parse_xref_table(stream) Iterator[XRefEntry]

Parse a single cross-reference table and yield its entries one by one.

This is internal API.

Parameters

stream – A file-like object pointed to the start of the cross-reference table.

Returns

A generator object yielding XRefEntry objects.