PDF Data Masking Application

This application provides functionality to detect and remove sensitive information from PDF files. It supports multiple languages and can process multiple PDFs simultaneously. The application also generates detailed reports showing which words were deleted and their locations.

Features

Sensitive Information Detection: Uses OCR and Named Entity Recognition to detect sensitive information such as names, phone numbers, and email addresses.
Multilingual Support: Supports English, Simplified Chinese, Traditional Chinese, and Korean.
Batch Processing: Can process multiple PDF files at once.
Detailed Reports: Generates a detailed report after processing, showing which words were deleted and their locations.
Real-time Progress Updates: Provides real-time progress updates for batch processing using Socket.IO.

Prerequisites

Python 3.7+
pip (Python package installer)
Tesseract OCR: Install Tesseract OCR from here.

Ensure that Tesseract is added to your system's PATH.

Installation

Clone the repository:

git clone https://github.com/yourusername/pdf-data-masking.git
cd pdf-data-masking

Create a virtual environment:
```
python -m venv env
```
Activate the virtual environment:
- On Windows:
```
.\env\Scripts\activate
```
- On macOS/Linux:
```
source env/bin/activate
```
Install the required packages:
```
pip install -r requirements.txt
```

Configuration

Create a config.py file in the src directory with the following content:

INPUT_PDF_PATH = 'data/input/input.pdf'
OUTPUT_PDF_PATH = 'data/output/output.pdf'

Running the Application

Run the Flask application::
```
python src/app.py
```
Open your web browser and navigate to http://127.0.0.1:5000

Usage

Single File Upload

Select a PDF file to upload.
Choose the language of the PDF.
Optionally, enter custom sensitive words/patterns separated by commas.
Click "Upload".

Multiple Files Upload

Select multiple PDF files to upload.
Choose the language of the PDFs.
Optionally, enter custom sensitive words/patterns separated by commas.
Click "Upload Multiple".

Reports

After processing, detailed reports showing the deleted words and their locations will be generated and saved in the reports directory.

Troubleshooting

Ensure Tesseract is installed and added to your system's PATH.
Verify that all required Python packages are installed using pip list.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Data Masking Application

Features

Prerequisites

Installation

Configuration

Running the Application

Usage

Single File Upload

Multiple Files Upload

Reports

Troubleshooting

Authors

About

Releases

Packages

Contributors 2

Languages

License

its-manishks/pdf_data_masking

Folders and files

Latest commit

History

Repository files navigation

PDF Data Masking Application

Features

Prerequisites

Installation

Configuration

Running the Application

Usage

Single File Upload

Multiple Files Upload

Reports

Troubleshooting

Authors

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages