Skip to content

nbrosse/pdf-parsing

Repository files navigation

pdf-parsing

This repository provides a comparison of various PDF parsing tools, complementing the blog post Comparison of PDF parsing options. It includes the code and output generated by each tool:

  • marker-pdf-folder
  • minerU-folder
  • gemini-folder
  • llamaparse-folder
  • docling-folder

In addition to the tool-specific folders, the repository also includes:

  • pdfs: The PDF files used for the comparison.
  • app: The code for the Streamlit application used to visualize the results.

To replicate the results within each tool's folder, please install the required Python environment using uv sync and follow the detailed instructions provided in the respective README.md file.

Visualizing the Results with the Streamlit App

You can explore the results in two ways:

  1. Hugging Face Space: Visit the PDF parsing demo on Hugging Face Spaces.

  2. Local Streamlit App: Run the Streamlit app locally. Navigate to the app folder in your terminal and execute the following commands:

    uv sync
    uv run streamlit run streamlit_app.py

Sources of PDF Files

The PDF files located in the pdfs directory were sourced from the following locations:

The following four PDF files are sourced from the RAG blog benchmark, specifically from the associated Google Drive folder:

  • gx-iif-open-data.pdf
  • deloitte-tech-risk-sector-banking.pdf
  • life-sciences-smart-manufacturing-services-peak-matrix-assessment-2023.pdf
  • dttl-tax-technology-report-2023.pdf

About

Comparison of PDF parsing options

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages