Python and PDF: A Review of Existing Tools

The Portable Document Format (PDF) was invented in the early 1990s and it’s still thriving. But PDFs are mainly for humans – not machines. So it’s often hard to automatically extract information out of PDFs. Besides, more and more functionalities were put into PDF. The complexity of the format makes it hard to get started. But there are existing, well-established software tools. Some have been around for decades. In this blog post, I review some existing tools with the focus on Python. Python is the default language for data processing right now.

PDF Utilities / System Libraries

There are some PDF libraries that power PDF Viewers but can also be used to work on PDFs from the command line. There are general purpose tools that can be used for several tasks.

ghostscript: PDF started with ghostscript so this is the mother of all PDF tools.
poppler: Used for the vast majority of Linux-based PDF viewers, Libre Office, etc. ‘poppler-utils’ is a collection of tools built upon poppler, e.g. ‘pdftotext’.
mutool: Another alternative, lightweight PDF library (there is also a PDF viewer).
qpdf: Not a fully-fledged PDF library, mainly to manipulate PDFs.

PDF Toolkits / Python Bindings

pymupdf: wrapper around ‘mutool’ (but also extends it in some cases (e.g. PDF EmbeddedFiles)). General purpose tool with a lot of example scripts.
pikepdf (recommended): wrapper around ‘qpdf’.
pdflib: wrapper around ‘poppler’. This library replaces the need for the specialized packages pdftotext and pdf2image that also use ‘poppler’ underneath. However, it’s hard to setup.
pdftk: PDF manipulation. (unmaintained, a maintained Java re-implementation)
pypdfium2: bindings for Google’s PDFium similar to pymupdf, but not licensed under GPL

Creating & Reading PDFs

PyPDF4: Python-only PDF manipulation. There is quite a history about forks (PyPDF, PyPDF2, PyPDF4).
pdfrw (unmaintained)
reportlab: can only create PDFs
Python-PDFKit: create PDFs from HTML, a wrapper around wkhtmltopdf:
WeasyPrint: another tool to create PDFs from HTML
matplotlib: generally a plotting library but it’s also able to generate PDFs

Getting Information out of PDFs

parsr (recommended): tries to transform PDF into structured data, internally uses ‘pdfminer.six’, ‘camelot’ and more.
pdfminer.six: a maintained fork of pdfminer
pdfplumber (recommended): works best on digital PDFs, built upon ‘pdfminer.six’
py-pdf-parser: another tool built upon ‘pdfminer.six’, includes a simple tool to visualize elements of an PDF document
pdfreader: pure Python
pdf-to-markdown: using pdf.js to turn PDFs to Markdown
pd3f: PDF text extraction pipeline based on parsr, ocrmypdf and other tools (I’m the author)

Tables

camelot: new tool
tabula: old tool
pdftabextract: last resort for e.g. scanned PDFs

Invoices

invoice2data: extract content from invoices with with help of pre-defined templates

General Text Extraction of Files

Tika: oldschool text extraction in Java, tika-python
textract: very similar to Tika but in Python

OCR

OCRmyPDf: wrapper around tesseract
EasyOCR: new deep-learning-based OCR

Preprocessing Scans

ScanTailor: GUI for post-processing scanned pages (unmaintained), but a mantained fork exists: ScanTailor advanced.
pdfCropMargins: cropping the margins from a PDF
unpaper: oldschool image preprocessing, no Python bindings
pypillowfight: new fork of unpaper (incl. Python bindings)
krop: GUI for cropping PDF

Document Management Systems

Mayan: oldschool
papermerge: newscool
docspell
Aleph: large-scale document processing, mostly for investigative journalism and leaks
OpenPaper: Linux/Windows programm

Miscellaneous

mat2: metadata removal
DangerZone: GUI to create safe PDFs, 1. transform PDF into images, 2. transform images into a PDF, 3. OCR (& compress) PDF
pdfc: compression
pdf-redactor: redaction
OpenRedact: app and other tools for redaction
pdf-scripts: a collection of scripts (Bash / Python) to work on PDFs (I’m the author)
archive-pdf-tools: fast PDF generation and compression by the Internet Archive
k2pdfopt
borb

Machine Learning based Document Layout Analysis

layout-parser: using machine learning (computer vision) to structure the PDF layout
eynollah
deepdoctection
doctr

Further Resources

Please write me an email if you think a tool is missing.