Python and PDF: A Review of Existing Tools

#PDF #Python

The Portable Document Format (PDF) was invented in the early 1990s and it’s still thriving. But PDFs are mainly for humans – not machines. So it’s often hard to automatically extract information out of PDFs. Besides, more and more functionalities were put into PDF. The complexity of the format makes it hard to get started. But there are existing, well-established software tools. Some have been around for decades. In this blog post, I review some existing tools with the focus on Python. Python is the default language for data processing right now.

PDF Utilities / System Libraries

There are some PDF libraries that power PDF Viewers but can also be used to work on PDFs from the command line. There are general purpose tools that can be used for several tasks.

PDF Toolkits / Python Bindings

Creating & Reading PDFs

Getting Information out of PDFs

Tables

Invoices

General Text Extraction of Files

OCR

Preprocessing Scans

Document Management Systems

Miscellaneous

Machine Learning based Document Layout Analysis

Further Resources

Please write me an email if you think a tool is missing.