This project implements a robust three-stage OCR pipeline that evolves from traditional text extraction to an intelligent hybrid system using deep learning. It’s designed for structured document processing and is ideal for scanned PDFs, images, and forms.
All models run within a Python virtual environment for isolation and reproducibility.
A simple command-line tool using pytesseract for text extraction.
- Extracts plain text from images or PDFs
- Lightweight and easy to run
A user-friendly web interface built with Flask that uses Tesseract OCR.
- Upload PDF or image files
- Web-based interaction
- Real-time OCR results on browser
A high-performance OCR system that combines Microsoft's TrOCR (transformers) with Tesseract for layout and accuracy.
- Transformer-based deep learning OCR
- Hybrid inference with fallback/combination strategy
- Best suited for scanned documents, handwritten text, or low-quality images
python -m venv venv
# Activate the environment
source venv/bin/activate # On macOS/Linux
venv\Scripts\activate # On WindowsUse this common requirements.txt (add other model-specific ones if needed):
flask
pytesseract
pdf2image
pillow
transformers
torchInstall via:
pip install -r requirements.txtpytesseractPillow- Tesseract-OCR installed on system
python model1_tesseract_ocr.pyExtracted text is printed in the terminal or saved to a .txt file.
flaskpytesseractpdf2imagePillowWerkzeug(comes with Flask)
python app.pyOpen http://localhost:5000 to upload documents and view OCR results.
templates/
└── index.html
static/
└── style.css
uploads/
└── (temporary user uploads)
transformerstorchpytesseractpdf2imagePillow
python model3_hybrid_ocr.py- Uses
TrOCRmodel from Hugging Face (microsoft/trocr-base-stage1) - Performs multi-pass OCR for better accuracy
- Can process multiple page documents
Download installer from:
https://github.com/tesseract-ocr/tesseract
Add Tesseract path to environment variables, e.g.:
C:\Program Files\Tesseract-OCR\tesseract.exesudo apt update
sudo apt install tesseract-ocrbrew install tesseractOCR-Project/
├── venv/
├── requirements.txt
├── model1_tesseract_ocr.py
├── model3_hybrid_ocr.py
├── app.py
├── templates/
├── static/
├── uploads/
└── README.md
- Model 1: Outputs plain text
- Model 2: Web preview + downloadable text
- Model 3: Enhanced text output, may include layout-aware content
- Always activate the virtual environment before running any model.
- Model 3 requires internet on first run (to download TrOCR).
- GPU usage (if available) can speed up TrOCR model inference.
- Make sure Tesseract is correctly installed and its path is set.
Pull requests are welcome! Suggestions for layout-aware models, table extraction, or handwriting recognition modules are highly appreciated.