pyhanko.sign.diff_analysis package

Changed in version 0.2.0: Module extracted from pyhanko.sign.validation and restructured into a more rule-based format.

Changed in version 0.11.0: Module refactored into sub-package.

This package defines utilities for difference analysis between revisions of the same PDF file. PyHanko uses this functionality to validate signatures on files that have been modified after signing (using PDF’s incremental update feature).

In pyHanko’s validation model, every incremental update is disallowed by default. For a change to be accepted, it must be cleared by at least one whitelisting rule. These rules can moreover qualify the modification level at which they accept the change (see ModificationLevel). Additionally, any rule can veto an entire revision as suspect by raising a SuspiciousModification exception. Whitelisting rules are encouraged to apply their vetoes liberally.

Whitelisting rules are bundled in DiffPolicy objects for use by the validator.

Guidelines for developing rules for use with StandardDiffPolicy

Caution

These APIs aren’t fully stable yet, so some changes might still occur between now and the first major release.

In general, you should keep the following informal guidelines in mind when putting together custom diff rules.

  • All rules are either executed completely (i.e. their generators exhausted) or aborted.

  • If the diff runner aborts a rule, this always means that the entire revision is rejected. In other words, for accepted revisions, all rules will always have run to completion.

  • Whitelisting rules are allowed to informally delegate some checking to other rules, provided that this is documented clearly.

    Note

    Example: CatalogModificationRule ignores /AcroForm, which is validated by another rule entirely.

  • Rules should be entirely stateless. “Clearing” a reference by yielding it does not imply that the revision cannot be vetoed by that same rule further down the road (this is why the first point is important).

Subpackages

Submodules

pyhanko.sign.diff_analysis.commons module

Module defining common helpers for use by rules and policies.

In principle, these aren’t relevant to the high-level validation API.

pyhanko.sign.diff_analysis.commons.qualify(level: ModificationLevel, rule_result: Generator[RefToUpd, None, R]) Generator[Tuple[ModificationLevel, RefToUpd], None, R]

This is a helper function for rule implementors. It attaches a fixed modification level to an existing reference update generator, respecting the original generator’s return value (if relevant).

A prototypical use would be of the following form:

def some_generator_function():
    # do stuff
    for ref in some_list:
        # do stuff
        yield ref

    # do more stuff
    return summary_value

# ...

def some_qualified_generator_function():
    summary_value = yield from qualify(
        ModificationLevel.FORM_FILLING,
        some_generator_function()
    )

Provided that some_generator_function yields ReferenceUpdate objects, the yield type of the resulting generator will be tuples of the form (level, ref).

Parameters:
  • level – The modification level to set.

  • rule_result – A generator that outputs references to be whitelisted.

Returns:

A converted generator that outputs references qualified at the modification level specified.

pyhanko.sign.diff_analysis.commons.qualify_transforming(level: ModificationLevel, rule_result: Generator[QualifyIn, None, R], transform: Callable[[QualifyIn], OutRefUpd]) Generator[Tuple[ModificationLevel, OutRefUpd], None, R]

This is a version of qualify() that additionally allows a transformation to be applied to the output of the rule.

Parameters:
  • level – The modification level to set.

  • rule_result – A generator that outputs references to be whitelisted.

  • transform – Function to apply to the reference object before appending the modification level and yielding it.

Returns:

A converted generator that outputs references qualified at the modification level specified.

pyhanko.sign.diff_analysis.commons.safe_whitelist(old: HistoricalResolver, old_ref, new_ref) Generator[Reference, None, None]

Checks whether an indirect reference in a PDF structure can be updated without clobbering an older object in a way that causes ramifications at the PDF syntax level.

The following are verified:

  • Does the old reference point to a non-stream object?

  • If the new reference is equal to the old one, does the new reference point to a non-stream object?

  • If the new reference is not equal to the old one, is the new reference a newly defined object?

This is a generator for syntactical convenience and integration with internal APIs, but it will always yield at most one element.

pyhanko.sign.diff_analysis.commons.compare_key_refs(key, old: HistoricalResolver, old_dict: DictionaryObject, new_dict: DictionaryObject) Generator[Reference, None, Tuple[PdfObject | None, PdfObject | None]]

Ensure that updating a key in a dictionary has no undesirable side effects. The following scenarios are allowed:

  1. replacing a direct value with another direct value

  2. adding a key in new_dict

  3. replacing a direct value in old_dict with a reference in new_dict

  4. the reverse (allowed by default)

  5. replacing a reference with another reference (that doesn’t override anything else)

The restrictions of safe_whitelist apply to this function as well.

Note: this routine is only safe to use if the structure of the resulting values is also checked. Otherwise, it can lead to reference leaks if one is not careful.

pyhanko.sign.diff_analysis.commons.compare_dicts(old_dict: PdfObject | None, new_dict: PdfObject | None, ignored: FrozenSet[str] = frozenset({}), raise_exc=True) bool

Compare entries in two dictionaries, optionally ignoring certain keys.

pyhanko.sign.diff_analysis.commons.assert_not_stream(obj)

Throw SuspiciousModification if the argument is a stream object.

pyhanko.sign.diff_analysis.constants module

Internal constants for the difference analysis sub-package.

pyhanko.sign.diff_analysis.form_rules_api module

Module defining API types for use by form analysis rules.

In principle, these aren’t relevant to the high-level validation API.

class pyhanko.sign.diff_analysis.form_rules_api.FormUpdatingRule(field_rules: List[FieldMDPRule], ignored_acroform_keys=None)

Bases: object

Special whitelisting rule that validates changes to the form attached to the input document.

This rule is special in two ways:

  • it outputs FormUpdate objects instead of references;

  • it delegates most of the hard work to sub-rules (instances of FieldMDPRule).

A DiffPolicy can have at most one FormUpdatingRule, but there is no limit on the number of FieldMDPRule objects attached to it.

FormUpdate objects contain a reference plus metadata about the form field it belongs to.

Parameters:
  • field_rules – A list of FieldMDPRule objects to validate the individual form fields.

  • ignored_acroform_keys – Keys in the /AcroForm dictionary that may be changed. Changes are potentially subject to validation by other rules.

apply(old: HistoricalResolver, new: HistoricalResolver) Iterable[Tuple[ModificationLevel, FormUpdate]]

Evaluate changes in the document’s form between two revisions.

Parameters:
  • old – The older, base revision.

  • new – The newer revision to be vetted.

class pyhanko.sign.diff_analysis.form_rules_api.FormUpdate(updated_ref: Reference, context_checked: Context | None = None, field_name: str | None = None, valid_when_locked: bool = False, valid_when_certifying: bool = True)

Bases: ReferenceUpdate

Container for a reference together with (optional) metadata.

Currently, this metadata consists of the relevant field’s (fully qualified) name, and whether the update should be approved or not if said field is locked by the FieldMDP policy currently in force.

field_name: str | None = None

The relevant field’s fully qualified name, or None if there’s either no obvious associated field, or if there are multiple reasonable candidates.

valid_when_locked: bool = False

Flag indicating whether the update is valid even when the field is locked. This is only relevant if field_name is not None.

valid_when_certifying: bool = True

Flag indicating whether the update is valid when checking against an explicit DocMDP policy. Default is True. If False, the change will only be accepted if we are evaluating changes to a document after an approval signature.

class pyhanko.sign.diff_analysis.form_rules_api.FieldMDPRule

Bases: object

Sub-rules attached to a FormUpdatingRule.

apply(context: FieldComparisonContext) Iterable[Tuple[ModificationLevel, FormUpdate]]

Apply the rule to the given FieldComparisonContext.

Parameters:

context – The context of this form revision evaluation, given as an instance of FieldComparisonContext.

class pyhanko.sign.diff_analysis.form_rules_api.FieldComparisonSpec(field_type: str, old_field_ref: Reference | None, new_field_ref: Reference | None, old_canonical_path: RawPdfPath | None)

Bases: object

Helper object that specifies a form field name together with references to its old and new versions.

field_type: str

The (fully qualified) form field name.

old_field_ref: Reference | None

A reference to the field’s dictionary in the old revision, if present.

new_field_ref: Reference | None

A reference to the field’s dictionary in the new revision, if present.

old_canonical_path: RawPdfPath | None

Path from the trailer through the AcroForm structure to this field (in the older revision). If the field is new, set to None.

property old_field: DictionaryObject | None
Returns:

The field’s dictionary in the old revision, if present, otherwise None.

property new_field: DictionaryObject | None
Returns:

The field’s dictionary in the new revision, if present, otherwise None.

expected_contexts() Set[Context]
class pyhanko.sign.diff_analysis.form_rules_api.FieldComparisonContext(field_specs: Dict[str, FieldComparisonSpec], old: HistoricalResolver, new: HistoricalResolver)

Bases: object

Context for a form diffing operation.

field_specs: Dict[str, FieldComparisonSpec]

Dictionary mapping field names to FieldComparisonSpec objects.

old: HistoricalResolver

The older, base revision.

new: HistoricalResolver

The newer revision.

pyhanko.sign.diff_analysis.policies module

Module defining pyHanko’s standard difference policy implementation.

class pyhanko.sign.diff_analysis.policies.StandardDiffPolicy(global_rules: List[QualifiedWhitelistRule], form_rule: FormUpdatingRule | None, reject_object_freeing=True, ignore_orphaned_objects=True, ignore_identical_objects=True)

Bases: DiffPolicy

Run a list of rules to analyse the differences between two revisions.

Parameters:
  • global_rules – The QualifiedWhitelistRule objects encoding the rules to apply.

  • form_rule – The FormUpdatingRule that adjudicates changes to form fields and their values.

  • reject_object_freeing

    Always fail revisions that free objects that existed prior to signing.

    Note

    PyHanko resolves freed references to the null object in PDF, and a freeing instruction in a cross-reference section is always registered as a change that needs to be approved, regardless of the value of this setting.

    It is theoretically possible for a rule to permit deleting content, in which case allowing objects to be freed might be reasonable. That said, pyHanko takes the conservative default position to reject all object freeing instructions as suspect.

  • ignore_orphaned_objects – Some PDF writers create objects that aren’t used anywhere (tsk tsk). Since those don’t affect the “actual” document content, they can usually be ignored. If True, newly created orphaned objects will be cleared at level ModificationLevel.LTA_UPDATES. Default is True.

  • ignore_orphaned_objects – Some PDF writers overwrite objects with identical copies. Pointless and annoying, but also more or less harmless.

apply(old: HistoricalResolver, new: HistoricalResolver, field_mdp_spec: FieldMDPSpec | None = None, doc_mdp: MDPPerm | None = None) DiffResult

Execute the policy on a pair of revisions, with the MDP values provided. SuspiciousModification exceptions should be propagated.

Parameters:
  • old – The older, base revision.

  • new – The newer revision.

  • field_mdp_spec – The field MDP spec that’s currently active.

  • doc_mdp – The DocMDP spec that’s currently active.

Returns:

A DiffResult object summarising the policy’s judgment.

review_file(reader: PdfFileReader, base_revision: int | HistoricalResolver, field_mdp_spec: FieldMDPSpec | None = None, doc_mdp: MDPPerm | None = None) DiffResult | SuspiciousModification

Implementation of DiffPolicy.review_file() that reviews each intermediate revision between the base revision and the current one individually.

pyhanko.sign.diff_analysis.policies.DEFAULT_DIFF_POLICY = <pyhanko.sign.diff_analysis.policies.StandardDiffPolicy object>

Default DiffPolicy implementation.

This policy includes the following rules, all with the default settings. The unqualified rules in the list all have their updates qualified at level LTA_UPDATES.

pyhanko.sign.diff_analysis.policies.NO_CHANGES_DIFF_POLICY = <pyhanko.sign.diff_analysis.policies.StandardDiffPolicy object>

DiffPolicy implementation that does not provide any rules, and will therefore simply reject all changes.

pyhanko.sign.diff_analysis.policy_api module

class pyhanko.sign.diff_analysis.policy_api.ModificationLevel(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: OrderedEnum

Records the (semantic) modification level of a document.

Compare MDPPerm, which records the document modification policy associated with a particular signature, as opposed to the empirical judgment indicated by this enum.

NONE = 0

The document was not modified at all (i.e. it is byte-for-byte unchanged).

LTA_UPDATES = 1

The only updates are of the type that would be allowed as part of signature long term archival (LTA) processing. That is to say, updates to the document security store or new document time stamps. For the purposes of evaluating whether a document has been modified in the sense defined in the PAdES and ISO 32000-2 standards, these updates do not count. Adding form fields is permissible at this level, but only if they are signature fields. This is necessary for proper document timestamp support.

FORM_FILLING = 2

The only updates are extra signatures and updates to form field values or their appearance streams, in addition to the previous levels.

ANNOTATIONS = 3

In addition to the previous levels, manipulating annotations is also allowed at this level.

Note

This level is currently unused by the default diff policy, and modifications to annotations other than those permitted to fill in forms are treated as suspicious.

OTHER = 4

The document has been modified in ways that aren’t on the validator’s whitelist. This always invalidates the corresponding signature, irrespective of cryptographical integrity or /DocMDP settings.

exception pyhanko.sign.diff_analysis.policy_api.SuspiciousModification

Bases: ValueError

Error indicating a suspicious modification

class pyhanko.sign.diff_analysis.policy_api.DiffResult(modification_level: ModificationLevel, changed_form_fields: Set[str])

Bases: object

Encodes the result of a difference analysis on two revisions.

Returned by DiffPolicy.apply().

modification_level: ModificationLevel

The strictest modification level at which all changes pass muster.

changed_form_fields: Set[str]

Set containing the names of all changed form fields.

Note

For the purposes of this parameter, a change is defined as any FormUpdate where FormUpdate.valid_when_locked is False.

class pyhanko.sign.diff_analysis.policy_api.DiffPolicy

Bases: object

Analyse the differences between two revisions.

apply(old: HistoricalResolver, new: HistoricalResolver, field_mdp_spec: FieldMDPSpec | None = None, doc_mdp: MDPPerm | None = None) DiffResult

Execute the policy on a pair of revisions, with the MDP values provided. SuspiciousModification exceptions should be propagated.

Parameters:
  • old – The older, base revision.

  • new – The newer revision.

  • field_mdp_spec – The field MDP spec that’s currently active.

  • doc_mdp – The DocMDP spec that’s currently active.

Returns:

A DiffResult object summarising the policy’s judgment.

review_file(reader: PdfFileReader, base_revision: int | HistoricalResolver, field_mdp_spec: FieldMDPSpec | None = None, doc_mdp: MDPPerm | None = None) DiffResult | SuspiciousModification

Compare the current state of a file to an earlier version, with the MDP values provided. SuspiciousModification exceptions should be propagated.

If there are multiple revisions between the base revision and the current one, the precise manner in which the review is conducted is left up to the implementing class. In particular, subclasses may choose to review each intermediate revision individually, or handle them all at once.

Parameters:
  • reader – PDF reader representing the current state of the file.

  • base_revision – The older, base revision. You can choose between providing it as a revision index, or a HistoricalResolver instance.

  • field_mdp_spec – The field MDP spec that’s currently active.

  • doc_mdp – The DocMDP spec that’s currently active.

Returns:

A DiffResult object summarising the policy’s judgment.

pyhanko.sign.diff_analysis.rules_api module

Module defining common API types for use by rules and policies.

In principle, these aren’t relevant to the high-level validation API.

class pyhanko.sign.diff_analysis.rules_api.QualifiedWhitelistRule

Bases: object

Abstract base class for a whitelisting rule that outputs references together with the modification level at which they’re cleared.

This is intended for use by complicated whitelisting rules that need to differentiate between multiple levels.

apply_qualified(old: HistoricalResolver, new: HistoricalResolver) Iterable[Tuple[ModificationLevel, ReferenceUpdate]]

Apply the rule to the changes between two revisions.

Parameters:
  • old – The older, base revision.

  • new – The newer revision to be vetted.

class pyhanko.sign.diff_analysis.rules_api.WhitelistRule

Bases: object

Abstract base class for a whitelisting rule that simply outputs cleared references without specifying a modification level.

These rules are more flexible than rules of type QualifiedWhitelistRule, since the modification level can be specified separately (see WhitelistRule.as_qualified()).

apply(old: HistoricalResolver, new: HistoricalResolver) Iterable[ReferenceUpdate]

Apply the rule to the changes between two revisions.

Parameters:
  • old – The older, base revision.

  • new – The newer revision to be vetted.

as_qualified(level: ModificationLevel) QualifiedWhitelistRule

Construct a new QualifiedWhitelistRule that whitelists the object references from this rule at the level specified.

Parameters:

level – The modification level at which the output of this rule should be cleared.

Returns:

A QualifiedWhitelistRule backed by this rule.

class pyhanko.sign.diff_analysis.rules_api.ReferenceUpdate(updated_ref: pyhanko.pdf_utils.generic.Reference, context_checked: pyhanko.sign.diff_analysis.rules_api.Context | None = None)

Bases: object

updated_ref: Reference

Reference that was (potentially) updated.

context_checked: Context | None = None
classmethod curry_ref(**kwargs) Callable[[Reference], RefUpdateType]
property approval_type: ApprovalType
class pyhanko.sign.diff_analysis.rules_api.Context

Bases: object

classmethod from_absolute(pdf_handler: PdfHandler, absolute_path: RawPdfPath) AbsoluteContext
classmethod relative_to(start: DictionaryObject | ArrayObject | TrailerDictionary, path: RawPdfPath | int | str) RelativeContext
descend(path: RawPdfPath | int | str) Context
class pyhanko.sign.diff_analysis.rules_api.RelativeContext(anchor: pyhanko.pdf_utils.generic.Dereferenceable, relative_path: pyhanko.pdf_utils.reader.RawPdfPath)

Bases: Context

anchor: Dereferenceable

Reference to the container object. In comparisons, this should be the reference tied to the older revision.

relative_path: RawPdfPath

Path to the object from the container.

descend(path: RawPdfPath | int | str) RelativeContext
class pyhanko.sign.diff_analysis.rules_api.AbsoluteContext(path: pyhanko.pdf_utils.reader.RawPdfPath, pdf_handler: pyhanko.pdf_utils.rw_common.PdfHandler)

Bases: Context

path: RawPdfPath

Absolute path from the trailer.

pdf_handler: PdfHandler

The PDF handler to which this context is tied.

property relative_view: RelativeContext
descend(path: RawPdfPath | int | str) AbsoluteContext