The pdf-utils package

The pdf_utils package is the part of pyHanko that implements the logic for reading & writing PDF files.

Background and future perspectives

The core of the pdf_utils package is based on code from PyPDF2. I forked/vendored PyPDF2 because it was the Python PDF library that would be the easiest to adapt to the low-level needs of a digital signing tool like pyHanko.

The “inherited” parts mostly consist of the PDF parsing logic, filter implementations (though they’ve been heavily rewritten) and RC4 cryptography support. I stripped out most of the functionality that I considered “fluff” for the purposes of designing a DigSig tool, for several reasons:

  • When I started working on pyHanko, the PyPDF2 project was all but dead, the codebase largely untested and the internet was rife with complaints about all kinds of bugs. Removing code that I didn’t need served primarily as a way to reduce my maintenance burden, and to avoid attaching my name to potential bugs that I wasn’t willing to fix myself.

  • PyPDF2 included a lot of compatibility logic to deal with Python 2. I never had any interest in supporting Python versions prior to 3.7, so I ditched all that.

  • Stripping out unnecessary code left me with greater freedom to deviate from the PyPDF2 API where I considered it necessary to do so.

I may or may not split off the pdf_utils package into a fully-fledged Python PDF library at some point, but for now, it merely serves as pyHanko’s PDF toolbox. That said, if you need bare-bones access to PDF structures outside pyHanko’s digital signing context, you might find some use for it even in its current state.

This page is intended as a companion to the API reference for pyhanko.pdf_utils, rather than a detailed standalone guide.

Danger

For the reasons specified above, most of pyhanko.pdf_utils should be considered private API.

The internal data model for PDF objects isn’t particularly likely to change, but the text handling and layout code is rather primitive and immature, so I’m not willing to commit to freezing that API (yet).

Danger

There are a number of stream encoding schemes (or “filters”) that aren’t supported (yet), most notably the LZW compression scheme. Additionally, we don’t have support for all PNG predictors in the Flate decoder/encoder.

PDF object model

The pyhanko.pdf_utils.generic module maps PDF data structures to Python objects. PDF arrays, dictionaries and strings are largely interoperable with their native Python counterparts, and can (usually) be interfaced with in the same manner.

When dealing with indirect references, the package distinguishes between the following two kinds:

  • IndirectObject: this represents an indirect reference as embedded into another PDF object (e.g. a dictionary value given by an indirect object);

  • Reference: this class represents an indirect reference by itself, i.e. not as a PDF object.

This distinction is rarely relevant, but the fact that IndirectObject inherits from PdfObject means that it supports the container_ref API, which is meaningless for “bare” Reference objects.

As a general rule, use Reference whenever you’re using indirect objects as keys in a Python dictionary or collecting them into a set, but use IndirectObject if you’re writing indirect objects into PDF output.

PDF content abstractions

The pyhanko.pdf_utils.content module provides a fairly bare-bones abstraction for handling content that “compiles down” to PDF graphics operators, namely the PdfContent class. Among other things, it takes care of some of the PDF resource management boilerplate. It also allows you to easily encapsulate content into form XObjects when necessary.

Below, we briefly go over the uses of PdfContent within the library itself. These also serve as a template for implementing your own PdfContent subclasses.

Images

PyHanko relies on Pillow for image support. In particular, we currently support pretty much all RGB bitmap types that Pillow can handle. Other colour spaces are not (yet) available. Additionally, we currently don’t take advantage of PDF’s native JPEG support, or some of its more clever image compression techniques.

The pyhanko.pdf_utils.images module provides a PdfContent subclass (aptly named pyhanko.pdf_utils.images.PdfImage) as a convenience.

Text & layout

The layout code in pyHanko is currently very, very primitive, fragile and likely to change significantly going forward. That said, pyHanko can do some basic text box rendering, and is capable of embedding CID-keyed OTF fonts for use with CJK text, for example. Given the (for now) volatile state of the API, I won’t document it here, but you can take a look at pyhanko.pdf_utils.text and pyhanko.pdf_utils.font, or the code in pyhanko.stamp.