PDF Playground

A comparative analysis tool for different PDF parsing libraries in Python. This project helps evaluate and compare various PDF parsing solutions to understand their strengths and limitations.

📚 Supported PDF Parsing Libraries

PyPDF
PyMuPDF
Doctr
Marker
Markitdown
Docling
SmolDocling
Unstructured

🗂 Project Structure

pdf-playground/
├── src/                   # Source code for different PDF parsers
│   ├── docling.py        # Docling implementation
│   ├── doctr.py          # Doctr implementation
│   ├── marker.py         # Marker implementation
│   ├── markitdown.py     # Markitdown implementation
│   ├── pymupdf.py        # PyMuPDF implementation
│   ├── pypdf.py          # PyPDF implementation
│   ├── smoldocling.py    # SmolDocling implementation
│   ├── unstructured.py   # Unstructured implementation
│   ├── save_markdowns.py # Utility for saving results
│   └── settings.py       # Project settings
├── examples/             # Test PDF files
│   ├── academic_paper_figure.pdf
│   ├── attention_paper.pdf
│   ├── complex_layout.pdf
│   ├── french.pdf
│   ├── handwriting_form.pdf
│   ├── invoice.pdf
│   ├── magazine_complex_layout.pdf
│   ├── table.pdf
│   └── ...
├── results/              # Parsed output from different libraries
│   ├── docling_result_text.md
│   ├── doctr_result_text.md
│   ├── marker_result_text.md
│   ├── markitdown_result_text.md
│   ├── pymupdf4llm_result_text.md
│   ├── pypdf_result_text.md
│   ├── smoldocling_result_text.md
│   └── unstrctured_result_text.md
├── debug_data/          # Visual debugging data
│   └── PDF_Parsing_Analysis/
├── requirements.txt     # Project dependencies
└── notebook.ipynb      # Jupyter notebook for analysis

🚀 Getting Started

Clone the repository:

git clone https://github.com/yourusername/pdf-playground.git
cd pdf-playground

Install dependencies:

pip install -r requirements.txt

📄 Usage

Place your PDF files in the examples/ directory
Run individual parser implementations from the src/ directory
Check the parsed results in the results/ directory
Use the Jupyter notebook for comparative analysis

📊 Example Files

The examples/ directory contains various PDF files to test different parsing scenarios:

Academic papers with figures
Complex magazine layouts
Tables and merged cells
Handwriting forms
Invoices
Multi-language documents (e.g., French)

🔍 Results

Parsing results are saved as markdown files in the results/ directory. Each implementation has its own output file for easy comparison:

docling_result_text.md
doctr_result_text.md
marker_result_text.md
And more...

🛠 Development

To add a new PDF parser implementation:

Create a new Python file in the src/ directory
Implement the parsing logic
Use save_markdowns.py to save the results
Update the notebook to include the new parser in the comparison

📝 Requirements

See requirements.txt for a full list of dependencies.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Playground

📚 Supported PDF Parsing Libraries

🗂 Project Structure

🚀 Getting Started

📄 Usage

📊 Example Files

🔍 Results

🛠 Development

📝 Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
examples		examples
results		results
src		src
.gitignore		.gitignore
README.md		README.md
notebook.ipynb		notebook.ipynb
requirements.txt		requirements.txt

Raviguntakala/pdf-parser-comparision

Folders and files

Latest commit

History

Repository files navigation

PDF Playground

📚 Supported PDF Parsing Libraries

🗂 Project Structure

🚀 Getting Started

📄 Usage

📊 Example Files

🔍 Results

🛠 Development

📝 Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages