PDF OCR Extraction and Synthetic Data Suite

An advanced application built with Streamlit that provides a comprehensive suite of tools for PDF processing. It features a powerful OCR pipeline for extracting and correcting content from documents and a synthetic data generator for creating realistic scanned-document artifacts.

Author: enadream Date: June 15, 2025

Key Features

Dual Applications: A single interface provides access to two distinct apps: a PDF OCR Extractor and a Synthetic PDF Generator.
Multi-Engine & Multi-Language OCR: Choose between Tesseract and EasyOCR. Supports multiple languages including English and Turkish.
Intelligent Layout Detection: Automatically identifies text blocks and images, assigning each a unique, searchable ID.
Visual Debugging: Displays a labeled view of the processed page with colored bounding boxes for each detected content region.
AI-Powered Spell Correction: Uses language-specific spaCy models to significantly improve the accuracy of raw OCR text.
Configurable Synthetic Data: Generate realistic "scanned" documents by applying adjustable artifacts like blur, skew, noise, and ink smudges.
Interactive UI: A clean and user-friendly web interface built with Streamlit, featuring page-specific searching and configurable options.

Technology Stack

Backend: Python 3.11+
UI: Streamlit
OCR: Tesseract, EasyOCR
Image Processing: OpenCV, Pillow, pdf2image
AI/NLP: spaCy, contextualSpellCheck

Project Content

The application is a suite composed of two main tools accessible from the sidebar.

App 1: Synthetic PDF Generator

This tool is designed to create realistic test data for OCR models. It takes a clean, digital PDF and applies a series of augmentations to simulate the artifacts commonly found in scanned documents.

The process is as follows:

PDF to Image Conversion: The source PDF is converted into a sequence of high-resolution images.
Artifact Augmentation: Each image is processed to add random, configurable artifacts, including Gaussian Blur, Perspective Skew, Noise, Ink Smudges, and Brightness/Contrast Jitter.
PDF Re-assembly: The newly augmented images are combined into a final, synthetic PDF that looks like a real-world scanned document.

App 2: PDF OCR Extractor

This is the core extraction engine that digitizes PDF documents. The pipeline is designed for accuracy and usability.

Pipeline Explanation:

PDF Ingestion & Page Selection: The user uploads a PDF and can specify which pages to process (e.g., all, 1, 5, 2-8).
Image Conversion & Preprocessing: The selected PDF pages are converted to images, and a skew correction algorithm is applied to straighten text lines.
Layout Analysis & ID Generation: The system analyzes the page layout to differentiate text vs. image regions and assigns a unique, sequential ID to each (e.g., text_1, image_1).
Multi-Engine OCR: Text is extracted from the detected blocks using either Tesseract or EasyOCR.
AI-Powered Correction: Raw text is passed through a language-specific spaCy model to correct spelling and other common OCR errors.
Interactive Results: The final output is displayed in the UI, showing a labeled debug image and searchable content expanders corresponding to each ID.

Installation

1. Prerequisites

Before starting, ensure you have Python 3.11+ and Git installed on your system.

2. Setup and Installation

Instructions are provided for Linux, Windows, and macOS.

🐧 Linux (Debian/Ubuntu/Arch) Installation

Clone the Repository

git clone https://github.com/enadream/ocr-streamlit.git
cd ocr-streamlit

Create and Activate a Virtual Environment

python3 -m venv .venv
source .venv/bin/activate

Install System Dependencies (Tesseract & Poppler)

On Debian/Ubuntu:

sudo apt-get update && sudo apt-get install -y tesseract-ocr poppler-utils

On Arch Linux:
```
sudo pacman -S tesseract poppler
```

Install Python Packages This command installs all required Python libraries from the requirements.txt file.
```
pip install -r requirements.txt
```

Download Tesseract Language Models

On Debian/Ubuntu:

sudo apt-get install -y tesseract-ocr-eng tesseract-ocr-tur

On Arch Linux:

sudo pacman -S tesseract-data-eng tesseract-data-tur

Download spaCy AI Models

pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1.tar.gz
pip install https://huggingface.co/turkish-nlp-suite/tr_core_news_lg/resolve/main/tr_core_news_lg-1.0-py3-none-any.whl

🪟 Windows Installation

Clone the Repository

git clone https://github.com/enadream/ocr-streamlit.git
cd ocr-streamlit

Create and Activate a Virtual Environment

python -m venv .venv
.\.venv\Scripts\Activate.ps1

Install System Dependencies (Tesseract & Poppler)
- Tesseract: Download and run the official installer from Tesseract at UB Mannheim. During installation, make sure to check the box to "Add Tesseract to system PATH" and select the language packs for English and Turkish.
- Poppler: Download the latest Poppler for Windows binaries. Unzip the folder and add the full path to the bin directory (e.g., C:\Users\YourUser\Downloads\poppler-24.02.0\Library\bin) to your system's PATH environment variable.
Install Python Packages This command installs all required Python libraries from the requirements.txt file.
```
pip install -r requirements.txt
```

Download spaCy AI Models

pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1.tar.gz
pip install https://huggingface.co/turkish-nlp-suite/tr_core_news_lg/resolve/main/tr_core_news_lg-1.0-py3-none-any.whl

🍎 macOS Installation

Clone the Repository

git clone https://github.com/enadream/ocr-streamlit.git
cd ocr-streamlit

Create and Activate a Virtual Environment

python3 -m venv .venv
source .venv/bin/activate

Install System Dependencies with Homebrew If you don't have Homebrew, install it first.
```
brew install tesseract poppler
```
Note: The standard Tesseract formula on Homebrew includes all language packs.
Install Python Packages This command installs all required Python libraries from the requirements.txt file.
```
pip install -r requirements.txt
```

Download spaCy AI Models

pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1.tar.gz
pip install https://huggingface.co/turkish-nlp-suite/tr_core_news_lg/resolve/main/tr_core_news_lg-1.0-py3-none-any.whl

How to Run the App

After completing the installation, you can run the application with a single command from the project's root directory.

On Linux / macOS

# Ensure your virtual environment is active
source .venv/bin/activate

# Run the app
python -m app.main

On Windows

# Ensure your virtual environment is active
.\.venv\Scripts\Activate.ps1

# Run the app
python -m app.main

This will launch the Streamlit application in a new browser tab.

Project Structure

project/
|---- requirements.txt
|---- README.md
|---- app/
    |---- __init__.py
    |---- main.py
    |---- core/
    |   |---- __init__.py
    |   |---- config.py
    |   |---- image_processor.py
    |   |---- layout_detector.py
    |   |---- ocr_extractor.py
    |   |---- pdf_handler.py
    |---- data/
    |---- ui/
    |   |---- main_ui.py
    |---- utils/
        |---- __init__.py
        |---- spell_checker.py
        |---- synthetic_generator/
            |---- __init__.py
            |---- config.py
            |---- image_augmentor.py
            |---- pdf_processor.py

License

This project is proprietary and confidential. You may not copy, distribute, or share the source code without the express written permission of the author (enadream).

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
app		app
.gitignore		.gitignore
README.md		README.md
Report.pdf		Report.pdf
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDF OCR Extraction and Synthetic Data Suite

Key Features

Technology Stack

Project Content

App 1: Synthetic PDF Generator

App 2: PDF OCR Extractor

Installation

1. Prerequisites

2. Setup and Installation

How to Run the App

On Linux / macOS

On Windows

Project Structure

License

About

Uh oh!

Releases

Packages

Languages

enadream/OCR-Streamlit

Folders and files

Latest commit

History

Repository files navigation

PDF OCR Extraction and Synthetic Data Suite

Key Features

Technology Stack

Project Content

App 1: Synthetic PDF Generator

App 2: PDF OCR Extractor

Installation

1. Prerequisites

2. Setup and Installation

How to Run the App

On Linux / macOS

On Windows

Project Structure

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages