Implement OCR support and benchmark evaluation for PDF extraction #190

ljy03 · 2025-12-22T05:55:31Z

Added OCR fallback functionality in document_rag.py to enhance PDF text extraction using MinerU.
Introduced command-line arguments for OCR processing options, including auto-detection of scanned PDFs.
Created a new benchmark module for evaluating OCR accuracy using the olmOCR-Bench dataset, including metrics for Character Error Rate (CER) and Word Error Rate (WER).
Added setup script for downloading the olmOCR-Bench dataset and included README documentation for usage instructions.

This update significantly improves the document processing capabilities by integrating OCR support and providing a framework for evaluating its effectiveness.

Related Issues

[feat] OCR based application #158

Checklist

Tests pass (uv run pytest)
Code formatted (ruff format and ruff check)
Pre-commit hooks pass (pre-commit run --all-files)

- Added OCR fallback functionality in `document_rag.py` to enhance PDF text extraction using MinerU. - Introduced command-line arguments for OCR processing options, including auto-detection of scanned PDFs. - Created a new benchmark module for evaluating OCR accuracy using the olmOCR-Bench dataset, including metrics for Character Error Rate (CER) and Word Error Rate (WER). - Added setup script for downloading the olmOCR-Bench dataset and included README documentation for usage instructions. This update significantly improves the document processing capabilities by integrating OCR support and providing a framework for evaluating its effectiveness.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement OCR support and benchmark evaluation for PDF extraction #190

Implement OCR support and benchmark evaluation for PDF extraction #190

ljy03 commented Dec 22, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Implement OCR support and benchmark evaluation for PDF extraction #190

Are you sure you want to change the base?

Implement OCR support and benchmark evaluation for PDF extraction #190

Conversation

ljy03 commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related Issues

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ljy03 commented Dec 22, 2025 •

edited

Loading