Skip to content

A Python automation toolkit for organizing and archiving PDF medical reports. It uses pdfplumber to extract patient data from content and filenames, sorting files into folders with a multi-tiered matching logic. The project features automated, parallel zipping for archiving, progress bars, and detailed logging for skipped files.

License

Notifications You must be signed in to change notification settings

R3tr0gh057/Categorizer

Repository files navigation

Categorizer

A Python automation toolkit for organizing and archiving PDF report files into patient folders, split into two scripts for clarity and modularity, plus a shell script for fast parallel zipping.

Features

  • Interactive Path Selection: All scripts prompt the user for the required directories.
  • Automatic File Parsing (sorter.py): Extracts patient names and report dates from PDF filenames using a flexible pattern.
  • Advanced Sorting (updated-sorter.py):
    • Extracts patient name, body part/scan type, and age (from PDF) for robust folder matching.
    • Multi-tiered matching: name → age → body part, to resolve ambiguities.
    • Moves (not copies) files to the matched folder, cleaning up the source directory.
    • Generates a detailed skipped reports log (skipped_reports.txt) with grouped reasons.
  • Date Range Matching (sorter.py): Searches for the correct patient folder within a configurable date range to account for delays between scan and report dates.
  • Progress Bar: Displays a progress bar for file processing using tqdm (Python scripts).
  • Logging & Status Messages: Logs all actions and warnings to both a log file and the console. All major actions also print a status message (success, warning, error, info) to the terminal for real-time feedback.
  • Automatic Zipping & Archiving (zipper.py & shell/zipper.sh):
    • Zips each patient folder (inside month folders) and moves the resulting zip file to a user-specified directory. The original folders remain in place.
    • Checkpointing: Skips patient folders that have already been zipped (if a .zip file with the same name exists in the destination).
    • Parallel Zipping: Uses all CPU cores to zip multiple folders at once for speed.

Requirements

  • Python 3.6+ (for Python scripts)
  • tqdm (for progress bar in Python scripts)
  • pdfplumber (for extracting age from PDF in updated-sorter.py)
  • Bash, zip, and xargs (for shell/zipper.sh)

Install Python dependencies with:

pip install -r requirements.txt

Configuration (config.ini)

Before running the scripts, create and edit a config.ini file in the project root with the following structure:

[SORTER]
source_dir = D:\Path\To\PDFs
destination_dir = D:\Path\To\PatientFolders

[ZIPPER]
base_dir = D:\Path\To\PatientFolders
zipped_dir = D:\Path\To\ZippedOutput

Usage

1. Automated Workflow (run_all.bat)

  • To run both the sorting and zipping steps in sequence, use the provided batch script:
    run_all.bat
  • Ensure your config.ini is set up as described above.
  • The script will run sorter.py and then zipper.py using the paths from config.ini.

2. Advanced Sorting/Categorizing Reports (updated-sorter.py)

  • This script provides advanced matching using patient name, age (from PDF), and body part/scan type (from filename).
  • Interactive: Prompts the user for the source and destination directories at runtime (does NOT use config.ini).
  • Moves files to the matched folder (does not copy).
  • Generates a detailed skipped reports log (skipped_reports.txt) with grouped reasons for each skipped file.
  • Run with:
    python data-fetching/Sorting/updated-sorter.py
  • At the end, check the terminal summary, categorizer.log, and skipped_reports.txt for details.

3. Sorting/Categorizing Reports (sorter.py)

  • The script reads the source and destination directories from config.ini (see above).
  • Run with:
    python fullauto/sorter.py
  • No interactive input is required; all paths are taken from the config file.

4. Zipping Patient Folders (Python: zipper.py)

  • The script reads the base and zipped directories from config.ini (see above).
  • Run with:
    python fullauto/zipper.py
  • No interactive input is required; all paths are taken from the config file.

5. Zipping Patient Folders (Shell: shell/zipper.sh)

  • Prepare your folders:

    • Same structure as above: patient folders inside month folders.
    • Requires Bash, zip, and xargs (available on most Unix-like systems).
  • Run the shell zipper:

    bash shell/zipper.sh
    • Enter the path to the base directory containing patient folders (should contain month folders).
    • Enter the path to the directory where zipped folders should be stored.
  • Result:

    • All patient folders (inside all month folders) will be zipped in parallel (4 at a time by default) and saved to the destination directory.
    • Already-zipped folders are skipped (checkpointing).
    • The script prints progress and a final summary to the terminal.

Folder Structure for Zipping

base_dir/
  202405/
    PatientA/
    PatientB/
  202406/
    PatientC/
    PatientD/
  • Only the patient folders (e.g., PatientA, PatientB, etc.) will be zipped.

Filename Format (sorter.py and updated-sorter.py)

The sorter scripts expect PDF filenames to contain the patient name and (optionally) body part/scan type. Example:

ANJU NCCT HEAD 25_Jul25.pdf
RESHMA CT REPORT – THORAX (PLAIN AND CONTRAST).pdf
  • updated-sorter.py will extract the name and body part/scan type from the filename, and age from the PDF content.

Skipped Reports Log

  • Both sorter scripts generate a log of skipped files, but updated-sorter.py creates a detailed, grouped report in skipped_reports.txt.
  • Reasons include: ambiguous matches, missing folders, file already exists, parsing errors, etc.

Configuration

  • Date Range (sorter.py): The number of days to search back from the report date is set by DATE_SEARCH_RANGE_DAYS (default: 7 days).
  • Log Files:
    • Sorting: categorizer.log
    • Zipping (Python): zipper.log

Troubleshooting

  • Ensure the source, destination, and zipped directories exist and are accessible.
  • The scripts will log warnings if they cannot parse a filename, find a matching folder, or zip a folder.
  • If you do not see zipped folders in your specified directory, check for errors in the terminal or log file.
  • If you rerun zipper.py or shell/zipper.sh, they will skip folders that have already been zipped.

License

See LICENSE.

About

A Python automation toolkit for organizing and archiving PDF medical reports. It uses pdfplumber to extract patient data from content and filenames, sorting files into folders with a multi-tiered matching logic. The project features automated, parallel zipping for archiving, progress bars, and detailed logging for skipped files.

Topics

Resources

License

Stars

Watchers

Forks