Skip to content

DonkRonk17/File-Deduplicator

Repository files navigation

image

πŸ” Universal File Deduplicator

Python 3.8+ License: MIT

Find and manage duplicate files across any directory. Fast, accurate, and safe.

Created by Metaphy LLC


✨ Features

  • πŸš€ Fast - Multi-threaded hashing with smart pre-filtering
  • 🎯 Accurate - SHA-256 verification ensures true duplicates only
  • πŸ›‘οΈ Safe - Dry-run mode, interactive confirmation, move-to-trash option
  • πŸ“Š Flexible Output - Text report, JSON, or CSV export
  • πŸ”§ Customizable - Filter by size, extension, exclude directories
  • πŸ“¦ Zero Dependencies - Uses only Python standard library

πŸ“₯ Installation

Option 1: Clone from GitHub

git clone https://github.com/DonkRonk17/file-deduplicator.git
cd file-deduplicator

Option 2: Download directly

Download deduplicator.py and run it directly with Python 3.8+.

Requirements

  • Python 3.8 or higher (uses walrus operator and dataclasses)
  • No external dependencies required!

πŸš€ Quick Start

Basic Scan

# Scan current directory
python deduplicator.py .

# Scan a specific folder
python deduplicator.py /path/to/folder

# Scan with verbose output (shows all file paths)
python deduplicator.py /path/to/folder -v

Filter Files

# Only scan images
python deduplicator.py /photos --ext jpg png gif webp

# Only scan large files (>1MB)
python deduplicator.py /data --min-size 1048576

# Exclude specific directories
python deduplicator.py /project --exclude node_modules .git dist

Export Results

# Export to JSON
python deduplicator.py /data --json duplicates.json

# Export to CSV (for Excel/spreadsheets)
python deduplicator.py /data --csv duplicates.csv

Remove Duplicates

# ALWAYS do a dry run first!
python deduplicator.py /data --delete --dry-run

# Delete duplicates (keeps oldest file by default)
python deduplicator.py /data --delete --keep oldest

# Delete with interactive confirmation
python deduplicator.py /data --delete -i

# Move duplicates to trash folder instead of deleting
python deduplicator.py /data --move ./duplicates_trash

πŸ“– Command Reference

usage: deduplicator.py [-h] [-v] [-r] [--min-size MIN_SIZE] [--max-size MAX_SIZE]
                       [--ext EXT [EXT ...]] [--exclude DIR [DIR ...]]
                       [--threads THREADS] [--json FILE] [--csv FILE]
                       [--delete] [--move FOLDER] [--keep {oldest,newest,first}]
                       [--dry-run] [--interactive] [--version]
                       directory

Arguments:
  directory             Directory to scan

Options:
  -h, --help            Show help message
  -v, --verbose         Show all file paths in report
  -r, --no-recursive    Do not scan subdirectories
  --min-size SIZE       Minimum file size in bytes (default: 1)
  --max-size SIZE       Maximum file size in bytes
  --ext EXT [EXT ...]   Only scan files with these extensions
  --exclude DIR [DIR ...] Directory names to exclude
  --threads N           Number of threads for hashing (default: 4)

Output:
  --json FILE           Export results to JSON file
  --csv FILE            Export results to CSV file

Actions:
  --delete              Delete duplicate files
  --move FOLDER         Move duplicates to specified folder
  --keep {oldest,newest,first}
                        Which file to keep (default: oldest)
  --dry-run             Simulate actions without making changes
  -i, --interactive     Confirm each deletion

πŸ“Š Example Output

╔═══════════════════════════════════════════════════════════════════════════════╗
β•‘                     πŸ” UNIVERSAL FILE DEDUPLICATOR v1.0.0                     β•‘
β•‘                         Metaphy LLC - 2025                                     β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

πŸ” Scanning: /home/user/Documents
   Options: min_size=1, recursive=True

πŸ“ Phase 1: Collecting files...
   Found 15,432 files

⚑ Phase 2: Quick hash filtering...
   1,247 files need full hash verification

πŸ” Phase 3: Full hash verification...

πŸ“Š Phase 4: Building duplicate groups...

══════════════════════════════════════════════════════════════════════════
                    πŸ“‹ DUPLICATE FILE REPORT
══════════════════════════════════════════════════════════════════════════

πŸ”΄ Found 23 groups of duplicate files
   Total duplicates: 47
   Wasted space: 1.23 GB

----------------------------------------------------------------------

πŸ“ Group 1: 156.78 MB Γ— 3 files
   Hash: a1b2c3d4e5f6g7h8...
   Wasted: 313.56 MB
   🟒 /home/user/Documents/backup/video.mp4
   πŸ”΄ /home/user/Documents/projects/video.mp4
   πŸ”΄ /home/user/Downloads/video.mp4

πŸ“ Group 2: 45.32 MB Γ— 2 files
   Hash: 9f8e7d6c5b4a3210...
   Wasted: 45.32 MB
   🟒 /home/user/Documents/photos/vacation.jpg
   πŸ”΄ /home/user/Documents/photos/vacation_copy.jpg

----------------------------------------------------------------------
πŸ“Š SCAN STATISTICS
----------------------------------------------------------------------
   Files scanned:    15,432
   Files hashed:     1,247
   Data scanned:     24.56 GB
   Duplicates found: 47
   Wasted space:     1.23 GB
   Scan time:        12.34 seconds
══════════════════════════════════════════════════════════════════════════

πŸ”’ Safety Features

1. Dry Run Mode

Always use --dry-run first to see what would be deleted:

python deduplicator.py /data --delete --dry-run

2. Interactive Mode

Confirm each file before deletion:

python deduplicator.py /data --delete -i

3. Move Instead of Delete

Move duplicates to a folder for manual review:

python deduplicator.py /data --move ./review_these

4. Keep Strategy

Choose which file to keep:

  • --keep oldest - Keep the file with oldest modification date (default)
  • --keep newest - Keep the most recently modified file
  • --keep first - Keep the first file found during scan

πŸ› οΈ How It Works

  1. Phase 1: Size Grouping - Files are grouped by size. Files with unique sizes can't be duplicates.

  2. Phase 2: Quick Hash - Files of same size get a quick hash (first 4KB + last 4KB + size). This eliminates most non-duplicates quickly.

  3. Phase 3: Full Hash - Only files with matching quick hashes get full SHA-256 verification.

  4. Phase 4: Report - True duplicates are grouped and sorted by wasted space.

This multi-phase approach makes scanning fast even for large directories.


πŸ’‘ Tips

Cleaning Up Downloads

# Find duplicate downloads
python deduplicator.py ~/Downloads --min-size 1048576 -v

# Move duplicates for review
python deduplicator.py ~/Downloads --move ~/Downloads/duplicates

Photo Library Cleanup

# Find duplicate photos
python deduplicator.py ~/Photos --ext jpg jpeg png heic --json photo_dupes.json

# Review the JSON, then delete
python deduplicator.py ~/Photos --ext jpg jpeg png heic --delete --dry-run

Developer Project Cleanup

# Exclude build artifacts
python deduplicator.py ~/projects --exclude node_modules .git __pycache__ dist build

πŸ“„ License

MIT License - see LICENSE for details.


image

🀝 Contributing

Contributions welcome! Please feel free to submit a Pull Request.


πŸ“¬ Contact


πŸ™ Credits

Created by Randell Logan Smith and Team Brain at Metaphy LLC

Part of the HMSS (Heavenly Morning Star System) ecosystem.


About

Find and manage duplicate files across directories. Featuring multi-threaded hashing for speed and uses SHA-256 verification to ensure accuracy.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages