This application provides functionality to detect and remove sensitive information from PDF files. It supports multiple languages and can process multiple PDFs simultaneously. The application also generates detailed reports showing which words were deleted and their locations.
- Sensitive Information Detection: Uses OCR and Named Entity Recognition to detect sensitive information such as names, phone numbers, and email addresses.
- Multilingual Support: Supports English, Simplified Chinese, Traditional Chinese, and Korean.
- Batch Processing: Can process multiple PDF files at once.
- Detailed Reports: Generates a detailed report after processing, showing which words were deleted and their locations.
- Real-time Progress Updates: Provides real-time progress updates for batch processing using Socket.IO.
- Python 3.7+
- pip (Python package installer)
- Tesseract OCR: Install Tesseract OCR from here.
Ensure that Tesseract is added to your system's PATH.
-
Clone the repository:
git clone https://github.com/yourusername/pdf-data-masking.git cd pdf-data-masking
-
Create a virtual environment:
python -m venv env
-
Activate the virtual environment:
- On Windows:
.\env\Scripts\activate
- On macOS/Linux:
source env/bin/activate
- On Windows:
-
Install the required packages:
pip install -r requirements.txt
Create a config.py
file in the src
directory with the following content:
INPUT_PDF_PATH = 'data/input/input.pdf'
OUTPUT_PDF_PATH = 'data/output/output.pdf'
- Run the Flask application::
python src/app.py
- Open your web browser and navigate to http://127.0.0.1:5000
- Select a PDF file to upload.
- Choose the language of the PDF.
- Optionally, enter custom sensitive words/patterns separated by commas.
- Click "Upload".
- Select multiple PDF files to upload.
- Choose the language of the PDFs.
- Optionally, enter custom sensitive words/patterns separated by commas.
- Click "Upload Multiple".
- After processing, detailed reports showing the deleted words and their locations will be generated and saved in the
reports
directory.
- Ensure Tesseract is installed and added to your system's PATH.
- Verify that all required Python packages are installed using
pip list
.