Skip to content

Conversation

@ljy03
Copy link

@ljy03 ljy03 commented Dec 22, 2025

  • Added OCR fallback functionality in document_rag.py to enhance PDF text extraction using MinerU.
  • Introduced command-line arguments for OCR processing options, including auto-detection of scanned PDFs.
  • Created a new benchmark module for evaluating OCR accuracy using the olmOCR-Bench dataset, including metrics for Character Error Rate (CER) and Word Error Rate (WER).
  • Added setup script for downloading the olmOCR-Bench dataset and included README documentation for usage instructions.

This update significantly improves the document processing capabilities by integrating OCR support and providing a framework for evaluating its effectiveness.

Related Issues

[feat] OCR based application #158

Checklist

  • Tests pass (uv run pytest)
  • Code formatted (ruff format and ruff check)
  • Pre-commit hooks pass (pre-commit run --all-files)

- Added OCR fallback functionality in `document_rag.py` to enhance PDF text extraction using MinerU.
- Introduced command-line arguments for OCR processing options, including auto-detection of scanned PDFs.
- Created a new benchmark module for evaluating OCR accuracy using the olmOCR-Bench dataset, including metrics for Character Error Rate (CER) and Word Error Rate (WER).
- Added setup script for downloading the olmOCR-Bench dataset and included README documentation for usage instructions.

This update significantly improves the document processing capabilities by integrating OCR support and providing a framework for evaluating its effectiveness.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant