pdf-parsing

This repository provides a comparison of various PDF parsing tools, complementing the blog post Comparison of PDF parsing options. It includes the code and output generated by each tool:

marker-pdf-folder
minerU-folder
gemini-folder
llamaparse-folder
docling-folder

In addition to the tool-specific folders, the repository also includes:

pdfs: The PDF files used for the comparison.
app: The code for the Streamlit application used to visualize the results.

To replicate the results within each tool's folder, please install the required Python environment using uv sync and follow the detailed instructions provided in the respective README.md file.

Visualizing the Results with the Streamlit App

You can explore the results in two ways:

Hugging Face Space: Visit the PDF parsing demo on Hugging Face Spaces.
Local Streamlit App: Run the Streamlit app locally. Navigate to the app folder in your terminal and execute the following commands:
```
uv sync
uv run streamlit run streamlit_app.py
```

Sources of PDF Files

The PDF files located in the pdfs directory were sourced from the following locations:

XC9500_CPLD_Family-1-4.pdf: Downloaded from https://media.digikey.com/pdf/Data%20Sheets/AMD/XC9500_CPLD_Family.pdf
2023-conocophillips-aim-presentation-1-7.pdf: Downloaded from https://static.conocophillips.com/files/2023-conocophillips-aim-presentation.pdf

The following four PDF files are sourced from the RAG blog benchmark, specifically from the associated Google Drive folder:

gx-iif-open-data.pdf
deloitte-tech-risk-sector-banking.pdf
life-sciences-smart-manufacturing-services-peak-matrix-assessment-2023.pdf
dttl-tax-technology-report-2023.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pdf-parsing

Visualizing the Results with the Streamlit App

Sources of PDF Files

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
app		app
docling-folder		docling-folder
gemini-folder		gemini-folder
llamaparse-folder		llamaparse-folder
marker-pdf-folder		marker-pdf-folder
minerU-folder		minerU-folder
pdfs		pdfs
pymupdf-folder		pymupdf-folder
.gitignore		.gitignore
README.md		README.md

nbrosse/pdf-parsing

Folders and files

Latest commit

History

Repository files navigation

pdf-parsing

Visualizing the Results with the Streamlit App

Sources of PDF Files

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages