This repository provides a comparison of various PDF parsing tools, complementing the blog post Comparison of PDF parsing options. It includes the code and output generated by each tool:
marker-pdf-folder
minerU-folder
gemini-folder
llamaparse-folder
docling-folder
In addition to the tool-specific folders, the repository also includes:
pdfs
: The PDF files used for the comparison.app
: The code for the Streamlit application used to visualize the results.
To replicate the results within each tool's folder, please install the required Python environment using uv sync
and follow the detailed instructions provided in the respective README.md
file.
You can explore the results in two ways:
-
Hugging Face Space: Visit the PDF parsing demo on Hugging Face Spaces.
-
Local Streamlit App: Run the Streamlit app locally. Navigate to the
app
folder in your terminal and execute the following commands:uv sync uv run streamlit run streamlit_app.py
The PDF files located in the pdfs
directory were sourced from the following locations:
XC9500_CPLD_Family-1-4.pdf
: Downloaded from https://media.digikey.com/pdf/Data%20Sheets/AMD/XC9500_CPLD_Family.pdf2023-conocophillips-aim-presentation-1-7.pdf
: Downloaded from https://static.conocophillips.com/files/2023-conocophillips-aim-presentation.pdf
The following four PDF files are sourced from the RAG blog benchmark, specifically from the associated Google Drive folder:
gx-iif-open-data.pdf
deloitte-tech-risk-sector-banking.pdf
life-sciences-smart-manufacturing-services-peak-matrix-assessment-2023.pdf
dttl-tax-technology-report-2023.pdf