Going beyond PDF with pd3f. pd3f is an Open-source PDF text extraction pipeline that is self-hosted, local-first and Docker-based. pd3f reconstructs the original continuous text with the help of machine learning.
The work was funded by the German Federal Ministry of Education and Research as part of the Prototype Fund.