pyhanko.pdf_utils.xref module
Internal utilities to handle the processing of cross-reference data and document trailer data.
This entire module is considered internal API.
- class pyhanko.pdf_utils.xref.XRefCache(reader, xref_sections: List[XRefSection])
Bases:
object
Internal class to parse & store information from the xref section(s) of a PDF document.
Stores both the most recent status of all xrefs in addition to their historical values.
All members of this class are considered internal API and are subject to change without notice.
- property total_revisions
- object_streams_used_in(revision)
- get_xref_container_info(revision) XRefSectionMetaInfo
- get_xref_data(revision) XRefSectionData
- explicit_refs_in_revision(revision) Set[Reference]
Look up the object refs for all objects explicitly added or overwritten in a given revision.
- Parameters
revision – A revision number. The oldest revision is zero.
- Returns
A set of Reference objects.
- refs_freed_in_revision(revision) Set[Reference]
Look up the object refs for all objects explicitly freed in a given revision.
- Parameters
revision – A revision number. The oldest revision is zero.
- Returns
A set of Reference objects.
- get_startxref_for_revision(revision) int
Look up the location of the XRef table/stream associated with a specific revision, as indicated by startxref or /Prev.
- Parameters
revision – A revision number. The oldest revision is zero.
- Returns
An integer pointer
- get_historical_ref(ref, revision) Optional[Union[int, ObjStreamRef]]
Look up the location of the historical value of an object.
Note
This method is not suitable for determining whether or not a particular object ID is available in a given revision, since it treats unused objects and freed objects the same way.
- Parameters
ref – An object reference.
revision – A revision number. The oldest revision is zero.
- Returns
An integer offset, an object stream reference, or
None
if the reference does not resolve in the specified revision.
- property hybrid_xrefs_present: bool
Determine if a file uses hybrid references anywhere.
- Returns
True
if hybrid references were detected,False
otherwise.
- class pyhanko.pdf_utils.xref.XRefBuilder(handler: PdfHandler, stream, strict: bool, last_startxref: int)
Bases:
object
- err_limit = 10
- read_xrefs()
- class pyhanko.pdf_utils.xref.XRefType(value)
Bases:
Enum
Different types of cross-reference entries.
- FREE = 1
A freeing instruction.
- STANDARD = 2
A regular top-level object.
- IN_OBJ_STREAM = 3
An object that is part of an object stream.
- class pyhanko.pdf_utils.xref.XRefEntry(xref_type: XRefType, location: Optional[Union[int, ObjStreamRef]], idnum: int, generation: int = 0)
Bases:
object
Value type representing a single cross-reference entry.
- location: Optional[Union[int, ObjStreamRef]]
Location the cross-reference points to.
- idnum: int
The ID of the object being referenced.
- generation: int = 0
The generation number of the object being referenced.
- class pyhanko.pdf_utils.xref.ObjStreamRef(obj_stream_id: int, ix_in_stream: int)
Bases:
object
Identifies an object that’s part of an object stream.
- obj_stream_id: int
The ID number of the object stream (its generation number is presumed zero).
- ix_in_stream: int
The index of the object in the stream.
- exception pyhanko.pdf_utils.xref.ObjectHeaderReadError(msg: str, *args)
Bases:
PdfReadError
- class pyhanko.pdf_utils.xref.XRefSection(meta_info: XRefSectionMetaInfo, xref_data: XRefSectionData)
Bases:
object
Describes a cross-reference section and describes how it is serialised into the PDF file.
- meta_info: XRefSectionMetaInfo
Metadata about the cross-reference section.
- xref_data: XRefSectionData
A description of the actual object pointer definitions.
- class pyhanko.pdf_utils.xref.XRefSectionData
Bases:
object
Internal class for bookkeeping on a single cross-reference section, independently of the others.
- try_resolve(ref: Union[Reference, IndirectObject]) Optional[Union[int, ObjStreamRef]]
- process_hybrid_entries(entries: Iterator[XRefEntry], xref_meta_info: XRefSectionMetaInfo, strict: bool)
- higher_generation_refs()
- class pyhanko.pdf_utils.xref.XRefSectionType(value)
Bases:
Enum
An enumeration.
- STANDARD = 1
- STREAM = 2
- HYBRID_MAIN = 3
- HYBRID_STREAM = 4
- class pyhanko.pdf_utils.xref.XRefSectionMetaInfo(xref_section_type: pyhanko.pdf_utils.xref.XRefSectionType, size: int, declared_startxref: int, start_location: int, end_location: int, stream_ref: Union[pyhanko.pdf_utils.generic.Reference, NoneType])
Bases:
object
- xref_section_type: XRefSectionType
The type of cross-reference section.
- size: int
The highest object ID in scope for this xref section.
- declared_startxref: int
Location pointed to by the startxref pointer in that revision.
- start_location: int
Actual start location of the xref data. This should be equal to declared_startxref, but in broken files that may not be the case.
- end_location: int
Location where the xref data ended.
- class pyhanko.pdf_utils.xref.TrailerDictionary
Bases:
PdfObject
The standard mandates that each trailer shall contain at least all keys used in the preceding trailer, even if unmodified. Of course, we cannot trust documents to actually follow this rule, so this class implements fallbacks.
- non_trailer_keys = {'/DecodeParms', '/Filter', '/Index', '/Length', '/Type', '/W', '/XRefStm'}
- add_trailer_revision(trailer_dict: DictionaryObject)
- raw_get(key, decrypt: EncryptedObjAccess = EncryptedObjAccess.TRANSPARENT, revision=None)
- flatten(revision=None) DictionaryObject
- keys()
- items()
- write_to_stream(stream, handler=None, container_ref=None)
Abstract method to render this object to an output stream.
- Parameters
stream – An output stream.
container_ref – Local encryption key.
handler – Security handler
- pyhanko.pdf_utils.xref.read_object_header(stream, strict)
- pyhanko.pdf_utils.xref.parse_xref_stream(xref_stream: StreamObject, strict: bool = True) Iterator[XRefEntry]
Parse a single cross-reference stream and yield its entries one by one.
This is internal API.
- Parameters
xref_stream – A
StreamObject
.strict – Boolean indicating whether we’re running in strict mode.
- Returns
A generator object yielding
XRefEntry
objects.
- pyhanko.pdf_utils.xref.parse_xref_table(stream) Iterator[XRefEntry]
Parse a single cross-reference table and yield its entries one by one.
This is internal API.
- Parameters
stream – A file-like object pointed to the start of the cross-reference table.
- Returns
A generator object yielding
XRefEntry
objects.
- pyhanko.pdf_utils.xref.write_xref_table(stream, position_dict: Dict[Tuple[int, int], int])
- class pyhanko.pdf_utils.xref.ObjectStream(compress=True)
Bases:
object
Utility class to collect objects into a PDF object stream.
Object streams are mainly useful for space efficiency reasons. They allow related objects to be grouped & compressed together in a more flexible manner.
Warning
Object streams can only be used in files with a cross-reference stream, as opposed to a classical XRef table. In particular, this means that incremental updates to files with a legacy XRef table cannot contain object streams either. See § 7.5.7 in ISO 32000-1 for further details.
Danger
Use
BasePdfFileWriter.prepare_object_stream()
to create instances of object streams. The __init__ function is internal API.- add_object(idnum: int, obj: PdfObject)
Add an object to an object stream. Note that objects in object streams always have their generation number set to 0 by definition.
- Parameters
idnum – The object’s ID number.
obj – The object to embed into the object stream.
- Raises
TypeError – Raised if
obj
is an instance ofStreamObject
orIndirectObject
.
- as_pdf_object() StreamObject
Render the object stream to a PDF stream object
- Returns
An instance of
StreamObject
.
- class pyhanko.pdf_utils.xref.XRefStream(position_dict: Dict[Tuple[int, int], Union[int, Tuple[int, int]]])
Bases:
StreamObject
- write_to_stream(stream, handler=None, container_ref=None)
Abstract method to render this object to an output stream.
- Parameters
stream – An output stream.
container_ref – Local encryption key.
handler – Security handler