pyhanko.sign.diff_analysis package
Changed in version 0.2.0: Module extracted from pyhanko.sign.validation
and restructured into a more rule-based format.
Changed in version 0.11.0: Module refactored into sub-package.
This package defines utilities for difference analysis between revisions of the same PDF file. PyHanko uses this functionality to validate signatures on files that have been modified after signing (using PDF’s incremental update feature).
In pyHanko’s validation model, every incremental update is disallowed by
default. For a change to be accepted, it must be cleared by at least one
whitelisting rule.
These rules can moreover qualify the modification level at which they accept
the change (see ModificationLevel
).
Additionally, any rule can veto an entire revision as suspect by raising
a SuspiciousModification
exception.
Whitelisting rules are encouraged to apply their vetoes liberally.
Whitelisting rules are bundled in DiffPolicy
objects for use by the
validator.
Guidelines for developing rules for use with StandardDiffPolicy
Caution
These APIs aren’t fully stable yet, so some changes might still occur between now and the first major release.
In general, you should keep the following informal guidelines in mind when putting together custom diff rules.
All rules are either executed completely (i.e. their generators exhausted) or aborted.
If the diff runner aborts a rule, this always means that the entire revision is rejected. In other words, for accepted revisions, all rules will always have run to completion.
Whitelisting rules are allowed to informally delegate some checking to other rules, provided that this is documented clearly.
Note
Example:
CatalogModificationRule
ignores/AcroForm
, which is validated by another rule entirely.Rules should be entirely stateless. “Clearing” a reference by yielding it does not imply that the revision cannot be vetoed by that same rule further down the road (this is why the first point is important).
Subpackages
- pyhanko.sign.diff_analysis.rules package
- Submodules
- pyhanko.sign.diff_analysis.rules.file_structure_rules module
- pyhanko.sign.diff_analysis.rules.form_field_rules module
- pyhanko.sign.diff_analysis.rules.metadata_rules module
- Module contents
Submodules
pyhanko.sign.diff_analysis.commons module
Module defining common helpers for use by rules and policies.
In principle, these aren’t relevant to the high-level validation API.
- pyhanko.sign.diff_analysis.commons.qualify(level: ModificationLevel, rule_result: Generator[RefToUpd, None, R]) Generator[Tuple[ModificationLevel, RefToUpd], None, R]
This is a helper function for rule implementors. It attaches a fixed modification level to an existing reference update generator, respecting the original generator’s return value (if relevant).
A prototypical use would be of the following form:
def some_generator_function(): # do stuff for ref in some_list: # do stuff yield ref # do more stuff return summary_value # ... def some_qualified_generator_function(): summary_value = yield from qualify( ModificationLevel.FORM_FILLING, some_generator_function() )
Provided that
some_generator_function
yieldsReferenceUpdate
objects, the yield type of the resulting generator will be tuples of the form(level, ref)
.- Parameters:
level – The modification level to set.
rule_result – A generator that outputs references to be whitelisted.
- Returns:
A converted generator that outputs references qualified at the modification level specified.
- pyhanko.sign.diff_analysis.commons.qualify_transforming(level: ModificationLevel, rule_result: Generator[QualifyIn, None, R], transform: Callable[[QualifyIn], OutRefUpd]) Generator[Tuple[ModificationLevel, OutRefUpd], None, R]
This is a version of
qualify()
that additionally allows a transformation to be applied to the output of the rule.- Parameters:
level – The modification level to set.
rule_result – A generator that outputs references to be whitelisted.
transform – Function to apply to the reference object before appending the modification level and yielding it.
- Returns:
A converted generator that outputs references qualified at the modification level specified.
- pyhanko.sign.diff_analysis.commons.safe_whitelist(old: HistoricalResolver, old_ref, new_ref) Generator[Reference, None, None]
Checks whether an indirect reference in a PDF structure can be updated without clobbering an older object in a way that causes ramifications at the PDF syntax level.
The following are verified:
Does the old reference point to a non-stream object?
If the new reference is equal to the old one, does the new reference point to a non-stream object?
If the new reference is not equal to the old one, is the new reference a newly defined object?
This is a generator for syntactical convenience and integration with internal APIs, but it will always yield at most one element.
- pyhanko.sign.diff_analysis.commons.compare_key_refs(key, old: HistoricalResolver, old_dict: DictionaryObject, new_dict: DictionaryObject) Generator[Reference, None, Tuple[PdfObject | None, PdfObject | None]]
Ensure that updating a key in a dictionary has no undesirable side effects. The following scenarios are allowed:
replacing a direct value with another direct value
adding a key in new_dict
replacing a direct value in old_dict with a reference in new_dict
the reverse (allowed by default)
replacing a reference with another reference (that doesn’t override anything else)
The restrictions of safe_whitelist apply to this function as well.
Note: this routine is only safe to use if the structure of the resulting values is also checked. Otherwise, it can lead to reference leaks if one is not careful.
- pyhanko.sign.diff_analysis.commons.compare_dicts(old_dict: PdfObject | None, new_dict: PdfObject | None, ignored: FrozenSet[str] = frozenset({}), raise_exc=True) bool
Compare entries in two dictionaries, optionally ignoring certain keys.
- pyhanko.sign.diff_analysis.commons.assert_not_stream(obj)
Throw
SuspiciousModification
if the argument is a stream object.
pyhanko.sign.diff_analysis.constants module
Internal constants for the difference analysis sub-package.
pyhanko.sign.diff_analysis.form_rules_api module
Module defining API types for use by form analysis rules.
In principle, these aren’t relevant to the high-level validation API.
- class pyhanko.sign.diff_analysis.form_rules_api.FormUpdatingRule(field_rules: List[FieldMDPRule], ignored_acroform_keys=None)
Bases:
object
Special whitelisting rule that validates changes to the form attached to the input document.
This rule is special in two ways:
it outputs
FormUpdate
objects instead of references;it delegates most of the hard work to sub-rules (instances of
FieldMDPRule
).
A
DiffPolicy
can have at most oneFormUpdatingRule
, but there is no limit on the number ofFieldMDPRule
objects attached to it.FormUpdate
objects contain a reference plus metadata about the form field it belongs to.- Parameters:
field_rules – A list of
FieldMDPRule
objects to validate the individual form fields.ignored_acroform_keys – Keys in the
/AcroForm
dictionary that may be changed. Changes are potentially subject to validation by other rules.
- apply(old: HistoricalResolver, new: HistoricalResolver) Iterable[Tuple[ModificationLevel, FormUpdate]]
Evaluate changes in the document’s form between two revisions.
- Parameters:
old – The older, base revision.
new – The newer revision to be vetted.
- class pyhanko.sign.diff_analysis.form_rules_api.FormUpdate(updated_ref: Reference, context_checked: Context | None = None, field_name: str | None = None, valid_when_locked: bool = False, valid_when_certifying: bool = True)
Bases:
ReferenceUpdate
Container for a reference together with (optional) metadata.
Currently, this metadata consists of the relevant field’s (fully qualified) name, and whether the update should be approved or not if said field is locked by the FieldMDP policy currently in force.
- field_name: str | None = None
The relevant field’s fully qualified name, or
None
if there’s either no obvious associated field, or if there are multiple reasonable candidates.
- valid_when_locked: bool = False
Flag indicating whether the update is valid even when the field is locked. This is only relevant if
field_name
is notNone
.
- valid_when_certifying: bool = True
Flag indicating whether the update is valid when checking against an explicit DocMDP policy. Default is
True
. IfFalse
, the change will only be accepted if we are evaluating changes to a document after an approval signature.
- class pyhanko.sign.diff_analysis.form_rules_api.FieldMDPRule
Bases:
object
Sub-rules attached to a
FormUpdatingRule
.- apply(context: FieldComparisonContext) Iterable[Tuple[ModificationLevel, FormUpdate]]
Apply the rule to the given
FieldComparisonContext
.- Parameters:
context – The context of this form revision evaluation, given as an instance of
FieldComparisonContext
.
- class pyhanko.sign.diff_analysis.form_rules_api.FieldComparisonSpec(field_type: str, old_field_ref: Reference | None, new_field_ref: Reference | None, old_canonical_path: RawPdfPath | None)
Bases:
object
Helper object that specifies a form field name together with references to its old and new versions.
- field_type: str
The (fully qualified) form field name.
- old_field_ref: Reference | None
A reference to the field’s dictionary in the old revision, if present.
- new_field_ref: Reference | None
A reference to the field’s dictionary in the new revision, if present.
- old_canonical_path: RawPdfPath | None
Path from the trailer through the AcroForm structure to this field (in the older revision). If the field is new, set to
None
.
- property old_field: DictionaryObject | None
- Returns:
The field’s dictionary in the old revision, if present, otherwise
None
.
- property new_field: DictionaryObject | None
- Returns:
The field’s dictionary in the new revision, if present, otherwise
None
.
- class pyhanko.sign.diff_analysis.form_rules_api.FieldComparisonContext(field_specs: Dict[str, FieldComparisonSpec], old: HistoricalResolver, new: HistoricalResolver)
Bases:
object
Context for a form diffing operation.
- field_specs: Dict[str, FieldComparisonSpec]
Dictionary mapping field names to
FieldComparisonSpec
objects.
- old: HistoricalResolver
The older, base revision.
- new: HistoricalResolver
The newer revision.
pyhanko.sign.diff_analysis.policies module
Module defining pyHanko’s standard difference policy implementation.
- class pyhanko.sign.diff_analysis.policies.StandardDiffPolicy(global_rules: List[QualifiedWhitelistRule], form_rule: FormUpdatingRule | None, reject_object_freeing=True, ignore_orphaned_objects=True, ignore_identical_objects=True)
Bases:
DiffPolicy
Run a list of rules to analyse the differences between two revisions.
- Parameters:
global_rules – The
QualifiedWhitelistRule
objects encoding the rules to apply.form_rule – The
FormUpdatingRule
that adjudicates changes to form fields and their values.reject_object_freeing –
Always fail revisions that free objects that existed prior to signing.
Note
PyHanko resolves freed references to the
null
object in PDF, and a freeing instruction in a cross-reference section is always registered as a change that needs to be approved, regardless of the value of this setting.It is theoretically possible for a rule to permit deleting content, in which case allowing objects to be freed might be reasonable. That said, pyHanko takes the conservative default position to reject all object freeing instructions as suspect.
ignore_orphaned_objects – Some PDF writers create objects that aren’t used anywhere (tsk tsk). Since those don’t affect the “actual” document content, they can usually be ignored. If
True
, newly created orphaned objects will be cleared at levelModificationLevel.LTA_UPDATES
. Default isTrue
.ignore_orphaned_objects – Some PDF writers overwrite objects with identical copies. Pointless and annoying, but also more or less harmless.
- apply(old: HistoricalResolver, new: HistoricalResolver, field_mdp_spec: FieldMDPSpec | None = None, doc_mdp: MDPPerm | None = None) DiffResult
Execute the policy on a pair of revisions, with the MDP values provided.
SuspiciousModification
exceptions should be propagated.- Parameters:
old – The older, base revision.
new – The newer revision.
field_mdp_spec – The field MDP spec that’s currently active.
doc_mdp – The DocMDP spec that’s currently active.
- Returns:
A
DiffResult
object summarising the policy’s judgment.
- review_file(reader: PdfFileReader, base_revision: int | HistoricalResolver, field_mdp_spec: FieldMDPSpec | None = None, doc_mdp: MDPPerm | None = None) DiffResult | SuspiciousModification
Implementation of
DiffPolicy.review_file()
that reviews each intermediate revision between the base revision and the current one individually.
- pyhanko.sign.diff_analysis.policies.DEFAULT_DIFF_POLICY = <pyhanko.sign.diff_analysis.policies.StandardDiffPolicy object>
Default
DiffPolicy
implementation.This policy includes the following rules, all with the default settings. The unqualified rules in the list all have their updates qualified at level
LTA_UPDATES
.FormUpdatingRule
, with the following field rules:
- pyhanko.sign.diff_analysis.policies.NO_CHANGES_DIFF_POLICY = <pyhanko.sign.diff_analysis.policies.StandardDiffPolicy object>
DiffPolicy
implementation that does not provide any rules, and will therefore simply reject all changes.
pyhanko.sign.diff_analysis.policy_api module
- class pyhanko.sign.diff_analysis.policy_api.ModificationLevel(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)
Bases:
OrderedEnum
Records the (semantic) modification level of a document.
Compare
MDPPerm
, which records the document modification policy associated with a particular signature, as opposed to the empirical judgment indicated by this enum.- NONE = 0
The document was not modified at all (i.e. it is byte-for-byte unchanged).
- LTA_UPDATES = 1
The only updates are of the type that would be allowed as part of signature long term archival (LTA) processing. That is to say, updates to the document security store or new document time stamps. For the purposes of evaluating whether a document has been modified in the sense defined in the PAdES and ISO 32000-2 standards, these updates do not count. Adding form fields is permissible at this level, but only if they are signature fields. This is necessary for proper document timestamp support.
- FORM_FILLING = 2
The only updates are extra signatures and updates to form field values or their appearance streams, in addition to the previous levels.
- ANNOTATIONS = 3
In addition to the previous levels, manipulating annotations is also allowed at this level.
Note
This level is currently unused by the default diff policy, and modifications to annotations other than those permitted to fill in forms are treated as suspicious.
- OTHER = 4
The document has been modified in ways that aren’t on the validator’s whitelist. This always invalidates the corresponding signature, irrespective of cryptographical integrity or
/DocMDP
settings.
- exception pyhanko.sign.diff_analysis.policy_api.SuspiciousModification
Bases:
ValueError
Error indicating a suspicious modification
- class pyhanko.sign.diff_analysis.policy_api.DiffResult(modification_level: ModificationLevel, changed_form_fields: Set[str])
Bases:
object
Encodes the result of a difference analysis on two revisions.
Returned by
DiffPolicy.apply()
.- modification_level: ModificationLevel
The strictest modification level at which all changes pass muster.
- changed_form_fields: Set[str]
Set containing the names of all changed form fields.
Note
For the purposes of this parameter, a change is defined as any
FormUpdate
whereFormUpdate.valid_when_locked
isFalse
.
- class pyhanko.sign.diff_analysis.policy_api.DiffPolicy
Bases:
object
Analyse the differences between two revisions.
- apply(old: HistoricalResolver, new: HistoricalResolver, field_mdp_spec: FieldMDPSpec | None = None, doc_mdp: MDPPerm | None = None) DiffResult
Execute the policy on a pair of revisions, with the MDP values provided.
SuspiciousModification
exceptions should be propagated.- Parameters:
old – The older, base revision.
new – The newer revision.
field_mdp_spec – The field MDP spec that’s currently active.
doc_mdp – The DocMDP spec that’s currently active.
- Returns:
A
DiffResult
object summarising the policy’s judgment.
- review_file(reader: PdfFileReader, base_revision: int | HistoricalResolver, field_mdp_spec: FieldMDPSpec | None = None, doc_mdp: MDPPerm | None = None) DiffResult | SuspiciousModification
Compare the current state of a file to an earlier version, with the MDP values provided.
SuspiciousModification
exceptions should be propagated.If there are multiple revisions between the base revision and the current one, the precise manner in which the review is conducted is left up to the implementing class. In particular, subclasses may choose to review each intermediate revision individually, or handle them all at once.
- Parameters:
reader – PDF reader representing the current state of the file.
base_revision – The older, base revision. You can choose between providing it as a revision index, or a
HistoricalResolver
instance.field_mdp_spec – The field MDP spec that’s currently active.
doc_mdp – The DocMDP spec that’s currently active.
- Returns:
A
DiffResult
object summarising the policy’s judgment.
pyhanko.sign.diff_analysis.rules_api module
Module defining common API types for use by rules and policies.
In principle, these aren’t relevant to the high-level validation API.
- class pyhanko.sign.diff_analysis.rules_api.QualifiedWhitelistRule
Bases:
object
Abstract base class for a whitelisting rule that outputs references together with the modification level at which they’re cleared.
This is intended for use by complicated whitelisting rules that need to differentiate between multiple levels.
- apply_qualified(old: HistoricalResolver, new: HistoricalResolver) Iterable[Tuple[ModificationLevel, ReferenceUpdate]]
Apply the rule to the changes between two revisions.
- Parameters:
old – The older, base revision.
new – The newer revision to be vetted.
- class pyhanko.sign.diff_analysis.rules_api.WhitelistRule
Bases:
object
Abstract base class for a whitelisting rule that simply outputs cleared references without specifying a modification level.
These rules are more flexible than rules of type
QualifiedWhitelistRule
, since the modification level can be specified separately (seeWhitelistRule.as_qualified()
).- apply(old: HistoricalResolver, new: HistoricalResolver) Iterable[ReferenceUpdate]
Apply the rule to the changes between two revisions.
- Parameters:
old – The older, base revision.
new – The newer revision to be vetted.
- as_qualified(level: ModificationLevel) QualifiedWhitelistRule
Construct a new
QualifiedWhitelistRule
that whitelists the object references from this rule at the level specified.- Parameters:
level – The modification level at which the output of this rule should be cleared.
- Returns:
A
QualifiedWhitelistRule
backed by this rule.
- class pyhanko.sign.diff_analysis.rules_api.ReferenceUpdate(updated_ref: pyhanko.pdf_utils.generic.Reference, context_checked: pyhanko.sign.diff_analysis.rules_api.Context | None = None)
Bases:
object
- property approval_type: ApprovalType
- class pyhanko.sign.diff_analysis.rules_api.Context
Bases:
object
- classmethod from_absolute(pdf_handler: PdfHandler, absolute_path: RawPdfPath) AbsoluteContext
- classmethod relative_to(start: DictionaryObject | ArrayObject | TrailerDictionary, path: RawPdfPath | int | str) RelativeContext
- descend(path: RawPdfPath | int | str) Context
- class pyhanko.sign.diff_analysis.rules_api.RelativeContext(anchor: pyhanko.pdf_utils.generic.Dereferenceable, relative_path: pyhanko.pdf_utils.reader.RawPdfPath)
Bases:
Context
- anchor: Dereferenceable
Reference to the container object. In comparisons, this should be the reference tied to the older revision.
- relative_path: RawPdfPath
Path to the object from the container.
- descend(path: RawPdfPath | int | str) RelativeContext
- class pyhanko.sign.diff_analysis.rules_api.AbsoluteContext(path: pyhanko.pdf_utils.reader.RawPdfPath, pdf_handler: pyhanko.pdf_utils.rw_common.PdfHandler)
Bases:
Context
- path: RawPdfPath
Absolute path from the trailer.
- pdf_handler: PdfHandler
The PDF handler to which this context is tied.
- property relative_view: RelativeContext
- descend(path: RawPdfPath | int | str) AbsoluteContext