An advanced application built with Streamlit that provides a comprehensive suite of tools for PDF processing. It features a powerful OCR pipeline for extracting and correcting content from documents and a synthetic data generator for creating realistic scanned-document artifacts.
Author: enadream Date: June 15, 2025
- Dual Applications: A single interface provides access to two distinct apps: a PDF OCR Extractor and a Synthetic PDF Generator.
- Multi-Engine & Multi-Language OCR: Choose between Tesseract and EasyOCR. Supports multiple languages including English and Turkish.
- Intelligent Layout Detection: Automatically identifies text blocks and images, assigning each a unique, searchable ID.
- Visual Debugging: Displays a labeled view of the processed page with colored bounding boxes for each detected content region.
- AI-Powered Spell Correction: Uses language-specific spaCy models to significantly improve the accuracy of raw OCR text.
- Configurable Synthetic Data: Generate realistic "scanned" documents by applying adjustable artifacts like blur, skew, noise, and ink smudges.
- Interactive UI: A clean and user-friendly web interface built with Streamlit, featuring page-specific searching and configurable options.
- Backend: Python 3.11+
- UI: Streamlit
- OCR: Tesseract, EasyOCR
- Image Processing: OpenCV, Pillow, pdf2image
- AI/NLP: spaCy, contextualSpellCheck
The application is a suite composed of two main tools accessible from the sidebar.
This tool is designed to create realistic test data for OCR models. It takes a clean, digital PDF and applies a series of augmentations to simulate the artifacts commonly found in scanned documents.
The process is as follows:
- PDF to Image Conversion: The source PDF is converted into a sequence of high-resolution images.
- Artifact Augmentation: Each image is processed to add random, configurable artifacts, including Gaussian Blur, Perspective Skew, Noise, Ink Smudges, and Brightness/Contrast Jitter.
- PDF Re-assembly: The newly augmented images are combined into a final, synthetic PDF that looks like a real-world scanned document.
This is the core extraction engine that digitizes PDF documents. The pipeline is designed for accuracy and usability.
Pipeline Explanation:
- PDF Ingestion & Page Selection: The user uploads a PDF and can specify which pages to process (e.g.,
all,1, 5,2-8). - Image Conversion & Preprocessing: The selected PDF pages are converted to images, and a skew correction algorithm is applied to straighten text lines.
- Layout Analysis & ID Generation: The system analyzes the page layout to differentiate text vs. image regions and assigns a unique, sequential ID to each (e.g.,
text_1,image_1). - Multi-Engine OCR: Text is extracted from the detected blocks using either Tesseract or EasyOCR.
- AI-Powered Correction: Raw text is passed through a language-specific spaCy model to correct spelling and other common OCR errors.
- Interactive Results: The final output is displayed in the UI, showing a labeled debug image and searchable content expanders corresponding to each ID.
Before starting, ensure you have Python 3.11+ and Git installed on your system.
Instructions are provided for Linux, Windows, and macOS.
🐧 Linux (Debian/Ubuntu/Arch) Installation
- Clone the Repository
git clone https://github.com/enadream/ocr-streamlit.git cd ocr-streamlit - Create and Activate a Virtual Environment
python3 -m venv .venv source .venv/bin/activate - Install System Dependencies (Tesseract & Poppler)
- On Debian/Ubuntu:
sudo apt-get update && sudo apt-get install -y tesseract-ocr poppler-utils - On Arch Linux:
sudo pacman -S tesseract poppler
- On Debian/Ubuntu:
- Install Python Packages
This command installs all required Python libraries from the
requirements.txtfile.pip install -r requirements.txt
- Download Tesseract Language Models
- On Debian/Ubuntu:
sudo apt-get install -y tesseract-ocr-eng tesseract-ocr-tur
- On Arch Linux:
sudo pacman -S tesseract-data-eng tesseract-data-tur
- On Debian/Ubuntu:
- Download spaCy AI Models
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1.tar.gz pip install https://huggingface.co/turkish-nlp-suite/tr_core_news_lg/resolve/main/tr_core_news_lg-1.0-py3-none-any.whl
🪟 Windows Installation
- Clone the Repository
git clone https://github.com/enadream/ocr-streamlit.git cd ocr-streamlit
- Create and Activate a Virtual Environment
python -m venv .venv .\.venv\Scripts\Activate.ps1 - Install System Dependencies (Tesseract & Poppler)
- Tesseract: Download and run the official installer from Tesseract at UB Mannheim. During installation, make sure to check the box to "Add Tesseract to system PATH" and select the language packs for English and Turkish.
- Poppler: Download the latest Poppler for Windows binaries. Unzip the folder and add the full path to the
bindirectory (e.g.,C:\Users\YourUser\Downloads\poppler-24.02.0\Library\bin) to your system's PATH environment variable.
- Install Python Packages
This command installs all required Python libraries from the
requirements.txtfile.pip install -r requirements.txt - Download spaCy AI Models
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1.tar.gz pip install https://huggingface.co/turkish-nlp-suite/tr_core_news_lg/resolve/main/tr_core_news_lg-1.0-py3-none-any.whl
🍎 macOS Installation
- Clone the Repository
git clone https://github.com/enadream/ocr-streamlit.git cd ocr-streamlit - Create and Activate a Virtual Environment
python3 -m venv .venv source .venv/bin/activate - Install System Dependencies with Homebrew
If you don't have Homebrew, install it first.
Note: The standard Tesseract formula on Homebrew includes all language packs.
brew install tesseract poppler
- Install Python Packages
This command installs all required Python libraries from the
requirements.txtfile.pip install -r requirements.txt
- Download spaCy AI Models
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1.tar.gz pip install https://huggingface.co/turkish-nlp-suite/tr_core_news_lg/resolve/main/tr_core_news_lg-1.0-py3-none-any.whl
After completing the installation, you can run the application with a single command from the project's root directory.
# Ensure your virtual environment is active
source .venv/bin/activate
# Run the app
python -m app.main# Ensure your virtual environment is active
.\.venv\Scripts\Activate.ps1
# Run the app
python -m app.mainThis will launch the Streamlit application in a new browser tab.
project/
|---- requirements.txt
|---- README.md
|---- app/
|---- __init__.py
|---- main.py
|---- core/
| |---- __init__.py
| |---- config.py
| |---- image_processor.py
| |---- layout_detector.py
| |---- ocr_extractor.py
| |---- pdf_handler.py
|---- data/
|---- ui/
| |---- main_ui.py
|---- utils/
|---- __init__.py
|---- spell_checker.py
|---- synthetic_generator/
|---- __init__.py
|---- config.py
|---- image_augmentor.py
|---- pdf_processor.py
This project is proprietary and confidential. You may not copy, distribute, or share the source code without the express written permission of the author (enadream).