Python and PDF: A Review of Existing Tools

by Johannes Filter, Apr 11, 2020 (last update: Aug 2, 2020)

The Portable Document Format (PDF) was invented in the early 1990s and it’s still thriving. But PDFs are mainly for humans – not machines. So it’s often hard to automatically extract information out of PDFs. Besides, more and more functionalities were put into PDF. The complexity of the format makes it hard to get started. But there are existing, well-established software tools. Some have been around for decades. In this blog post, I review some existing tools with the focus on Python. Python is the default language for data processing right now.

PDF Utility/System Libraries

There are some PDF libraries that power PDF Viewers but can also be used to work on PDFs from the command line.

  • ghostscript: PDF started with ghostscript so this is the mother of all PDF tools.
  • poppler: Used for the vast majority of Linux-based PDF viewers, Libre Office, etc. ‘poppler-utils’ is a collection of tools built upon poppler, e.g. ‘pdftotext’.
  • mutool: The library is used for an alternative, lightweight PDF viewer.
  • qpdf: Not a fully-fledged PDF library, mainly to manipulate PDFs.
  • pdftk (unmaintained, here a maintained Java re-implementation)

Creating & Reading PDFs

Getting Information out of PDFs

  • parsr (recommended): tries to transform PDF into structured data, internally uses ‘pdfminer.six’, ‘camelot’ and more.
  • pdfminer.six: a maintained fork of pdfminer
  • pdfplumber (recommended): works best on digital PDFs, built upon ‘pdfminer.six’
  • pdflib: wrapper around ‘poppler’. This library replaces the need for the specialized packages pdftotext and pdf2image that also use ‘poppler’ underneath. However, it’s hard to setup.

Tables

General Text Extraction of Files

OCR

Preprocessing Scans

  • Scan Tailor: GUI for post-processing scanned pages (unmaintained).
  • pdfCropMargins: cropping the margins from a PDF
  • unpaper: oldschool image preprocessing, no Python bindings
  • pypillowfight: new fork of unpaper (incl. Python bindings)

Miscellaneous

  • mat2: metadata removal
  • DangerZone: GUI to create safe PDFs, 1. transform PDF into images, 2. transform images into a PDF, 3. OCR (& compress) PDF
  • pdfc: compression
  • pdf-redactor: redaction
  • pdf-scripts: a collection of scripts to work on PDF (I’m the author)

Further Resources

Please write me an email if you think a tool is missing.