A Python script that batch processes PDF files using OCRmyPDF to make them searchable through Optical Character Recognition (OCR).
This script automates the process of converting non-searchable PDF documents into searchable ones using OCR technology. It processes all PDF files in a specified input directory and saves the OCR-processed versions to an output directory.
- Python 3.x
- OCRmyPDF (must be installed and accessible from command line)
-
Install OCRmyPDF:
# For Ubuntu/Debian apt-get install ocrmypdf # For macOS brew install ocrmypdf # For Windows pip install ocrmypdf
-
Clone this repository or download the script.
Modify the following variables in the script to match your environment:
input_folder = "C:\\root\\archive\\ocr_pend" # Folder containing original PDFs
output_folder = "C:\\root\\archive" # Folder for processed PDFs
- Place your PDF files in the input folder
- Run the script:
python pdf_to_ocr.py
The script will:
- Process all PDF files in the input folder
- Apply OCR to make them searchable
- Save the processed files to the output folder
- Print progress messages for each file
- Batch processing of multiple PDF files
- Automatic creation of output directory if it doesn't exist
- Error handling and progress reporting
- Maintains original file names
The script includes basic error handling that will:
- Print success messages for each successfully processed file
- Print error messages if processing fails for any file
- Continue processing remaining files even if one fails
Feel free to submit issues and enhancement requests!
This project is licensed under the MIT License - see the LICENSE file for details.
- Uses OCRmyPDF for PDF processing
⌨️ with 💻 by Raj Reddy
// Reach out if you find bugs in the matrix
// "Hello, World!" is just the beginning 🚀