PDF Search Plus is a powerful Python application that processes PDF files by extracting text from pages and images, applying OCR (Optical Character Recognition) to images, and storing the results in a SQLite database. It provides a graphical user interface (GUI) built with Tkinter to search and preview the PDF content, including OCR-extracted text.
- Features
- Installation
- Python Dependencies
- Usage
- Package Structure
- Database Schema
- Performance Optimizations
- Security Features
- Contributing
- License
- Acknowledgements
- Future Enhancements
- Extracts and stores text from PDF pages
- Extracts images from PDF pages and applies OCR using Tesseract
- Stores image metadata and OCR-extracted text in a SQLite database
- Provides a user-friendly GUI for searching through the stored data
- Allows for both single-file and folder-based (batch) PDF processing
- Enables preview of PDFs with zoom and navigation features
- Security features including input validation, sanitization, and SQL injection protection
- Caching system for PDF pages, search results, and images to improve performance
- Memory management for efficiently handling large PDFs
- Pagination for search results to handle large document collections
- Robust search capabilities with optimized Full-Text Search for fast and accurate results
- Document categorization and tagging for better organization of PDF files
- PDF annotations for highlighting and adding notes to documents
- Document similarity search for finding related documents
- Memory-aware caching that adapts to system resources for optimal performance
-
Clone the repository:
git clone https://github.com/Ap6pack/pdf-search-plus.git cd pdf-search-plus
-
Install the dependencies:
pip install -r requirements.txt
The application requires the Tesseract OCR command-line tool to be installed on your system:
- On Ubuntu:
sudo apt install tesseract-ocr
- On macOS (using Homebrew):
brew install tesseract
- On Windows: Download and install from Tesseract OCR for Windows.
Ensure that the tesseract
command is in your system's PATH. The application calls this command directly rather than using a Python wrapper.
All Python dependencies are specified in the requirements.txt
file and should be installed as mentioned in the Installation section above.
When installing the requirements, you may encounter dependency conflicts, particularly with numpy versions. If you see errors related to numpy version conflicts (e.g., with packages like thinc or spacy), you may need to uninstall the conflicting packages:
pip uninstall -y thinc spacy
pip install -r requirements.txt
This is because the application requires numpy<2.0 for compatibility with pandas 2.2.0, which may conflict with other packages that require numpy>=2.0.0.
The application is designed to be simple to use. Just run the main script and everything will be set up automatically:
python run_pdf_search.py
The database will be automatically created or validated when you run the application. No separate setup steps are required.
The application supports several command-line options:
--verbose
,-v
: Enable verbose logging--process-file FILE
: Process a single PDF file without launching the GUI--process-folder FOLDER
: Process all PDF files in a folder without launching the GUI--search TERM
: Search for a term in the database without launching the GUI--max-workers N
: Maximum number of worker threads for batch processing (default: 5)
-
Launch the GUI with verbose logging:
python run_pdf_search.py --verbose
-
Process a single PDF file from the command line:
python run_pdf_search.py --process-file path/to/document.pdf
-
Process a folder of PDF files:
python run_pdf_search.py --process-folder path/to/folder
-
Search the database from the command line:
python run_pdf_search.py --search "search term"
You can also run the application as a Python module:
python -m pdf_search_plus.main
-
Processing PDF Files:
- Click "Process PDF" in the main window
- Choose between single file or folder (batch) processing
- Select the PDF file or folder to process
- Wait for the processing to complete
-
Searching for Text:
- Click "Search PDFs" in the main window
- Enter a search term in the context field
- Toggle "Use Full-Text Search" option for faster searches on large collections
- Click "Search"
- View the results showing PDF file name, page number, and matching context
- Use pagination controls to navigate through large result sets
-
Previewing PDF Pages:
- Select a search result
- Click "Preview PDF"
- Use the navigation buttons to move between pages
- Use the zoom buttons to adjust the view
pdf_search_plus/
├── __init__.py
├── main.py
├── core/
│ ├── __init__.py
│ ├── pdf_processor.py
│ └── ocr/
│ ├── __init__.py
│ ├── base.py
│ └── tesseract.py
├── gui/
│ ├── __init__.py
│ └── search_app.py
└── utils/
├── __init__.py
├── db.py
├── cache.py
├── memory.py
├── security.py
├── tag_manager.py
├── annotation_manager.py
└── similarity_search.py
The application stores PDF data in an SQLite database called pdf_data.db
with the following structure:
-
pdf_files: Stores metadata for each processed PDF file
id
: Primary keyfile_name
: Name of the PDF filefile_path
: Path to the PDF filecreated_at
: Timestamp when the record was createdlast_accessed
: Timestamp when the record was last accessed
-
pages: Stores text extracted from each PDF page
id
: Primary keypdf_id
: Foreign key to pdf_filespage_number
: Page numbertext
: Extracted text
-
images: Stores metadata about extracted images from the PDF
id
: Primary keypdf_id
: Foreign key to pdf_filespage_number
: Page numberimage_name
: Name of the imageimage_ext
: Image extension
-
ocr_text: Stores the text extracted via OCR from images
id
: Primary keypdf_id
: Foreign key to pdf_filespage_number
: Page numberocr_text
: Text extracted via OCR
-
tags: Stores document tags for categorization
id
: Primary keyname
: Tag namecolor
: Tag color (hex code)created_at
: Timestamp when the tag was created
-
categories: Stores hierarchical document categories
id
: Primary keyname
: Category nameparent_id
: Foreign key to parent category (for hierarchical structure)created_at
: Timestamp when the category was created
-
pdf_tags: Many-to-many relationship between PDFs and tags
pdf_id
: Foreign key to pdf_filestag_id
: Foreign key to tagscreated_at
: Timestamp when the relationship was created
-
pdf_categories: Many-to-many relationship between PDFs and categories
pdf_id
: Foreign key to pdf_filescategory_id
: Foreign key to categoriescreated_at
: Timestamp when the relationship was created
The application provides robust search capabilities:
- Optimized Full-Text Search: Uses FTS5 virtual tables with porter stemming for fast and accurate text matching
- Tag-Based Search: Find documents by assigned tags with options for ANY or ALL tag matching
- Category-Based Organization: Browse documents by hierarchical categories
- Combined Search: Search by text content and tags simultaneously
The database includes optimized indexes for better performance:
- Compound indexes on
pdf_id
andpage_number
for faster joins - Specialized indexes for text columns for faster searching
- Indexes on file name and path for faster lookups
- Indexes for tag and category relationships
- Memory-Aware Caching: The application monitors system memory and adapts cache size dynamically
- Optimized FTS5 Search: Uses porter stemming and prefix matching for faster and more accurate searches
- Memory Management: Large PDFs are processed in a streaming fashion to reduce memory usage
- Batch Processing: Images are processed in batches to limit memory consumption
- Time-Based Cache Expiration: Automatically expires cached items after a specified time
- Pagination: Search results are paginated to handle large result sets efficiently
- Enhanced Input Validation: All user inputs are validated with comprehensive checks
- Secure Path Validation: File paths are validated to prevent path traversal attacks
- Secure Temporary Files: Temporary files are created with proper permissions and cleanup
- Text Sanitization: All text is sanitized to prevent XSS and other injection attacks
- SQL Injection Protection: Parameterized queries are used throughout the application
- Memory Pressure Detection: The application monitors and responds to system memory pressure
Contributions are welcome! Here's how you can contribute to PDF Search Plus:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add some amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
Please make sure to update tests as appropriate and adhere to the existing coding style.
This project is licensed under the MIT License - see the LICENSE file for details.
- PyMuPDF for PDF processing capabilities
- Tesseract OCR for text recognition
- SQLite for database functionality
- All contributors who have helped improve this project
The application supports document tagging and categorization:
- Tags: Assign colored tags to documents for quick identification and filtering
- Categories: Organize documents in hierarchical categories
- Tag-Based Search: Find documents by their assigned tags
- Multiple Tags: Assign multiple tags to each document
- Tag Management: Create, update, and delete tags
- Category Hierarchy: Create nested categories for better organization
The application now supports PDF annotations:
- Highlight Text: Highlight important text in documents
- Add Notes: Add notes to specific parts of documents
- Multiple Annotation Types: Support for highlights, notes, underlines, and more
- Annotation Search: Search for text within annotations
- Color Coding: Assign different colors to annotations for better organization
Find similar documents based on content:
- TF-IDF Vectorization: Convert document text into numerical vectors
- Cosine Similarity: Measure similarity between documents
- Document Clustering: Group similar documents together
- Text-Based Search: Find documents similar to a text query
- Threshold Control: Adjust similarity threshold for more or fewer results
- Add support for exporting search results
- Improve image OCR accuracy with advanced preprocessing
- Support for more languages in OCR
- Add support for PDF form field extraction
- Enhance tag visualization with tag clouds