Find and manage duplicate files across any directory. Fast, accurate, and safe.
Created by Metaphy LLC
- π Fast - Multi-threaded hashing with smart pre-filtering
- π― Accurate - SHA-256 verification ensures true duplicates only
- π‘οΈ Safe - Dry-run mode, interactive confirmation, move-to-trash option
- π Flexible Output - Text report, JSON, or CSV export
- π§ Customizable - Filter by size, extension, exclude directories
- π¦ Zero Dependencies - Uses only Python standard library
git clone https://github.com/DonkRonk17/file-deduplicator.git
cd file-deduplicatorDownload deduplicator.py and run it directly with Python 3.8+.
- Python 3.8 or higher (uses walrus operator and dataclasses)
- No external dependencies required!
# Scan current directory
python deduplicator.py .
# Scan a specific folder
python deduplicator.py /path/to/folder
# Scan with verbose output (shows all file paths)
python deduplicator.py /path/to/folder -v# Only scan images
python deduplicator.py /photos --ext jpg png gif webp
# Only scan large files (>1MB)
python deduplicator.py /data --min-size 1048576
# Exclude specific directories
python deduplicator.py /project --exclude node_modules .git dist# Export to JSON
python deduplicator.py /data --json duplicates.json
# Export to CSV (for Excel/spreadsheets)
python deduplicator.py /data --csv duplicates.csv# ALWAYS do a dry run first!
python deduplicator.py /data --delete --dry-run
# Delete duplicates (keeps oldest file by default)
python deduplicator.py /data --delete --keep oldest
# Delete with interactive confirmation
python deduplicator.py /data --delete -i
# Move duplicates to trash folder instead of deleting
python deduplicator.py /data --move ./duplicates_trashusage: deduplicator.py [-h] [-v] [-r] [--min-size MIN_SIZE] [--max-size MAX_SIZE]
[--ext EXT [EXT ...]] [--exclude DIR [DIR ...]]
[--threads THREADS] [--json FILE] [--csv FILE]
[--delete] [--move FOLDER] [--keep {oldest,newest,first}]
[--dry-run] [--interactive] [--version]
directory
Arguments:
directory Directory to scan
Options:
-h, --help Show help message
-v, --verbose Show all file paths in report
-r, --no-recursive Do not scan subdirectories
--min-size SIZE Minimum file size in bytes (default: 1)
--max-size SIZE Maximum file size in bytes
--ext EXT [EXT ...] Only scan files with these extensions
--exclude DIR [DIR ...] Directory names to exclude
--threads N Number of threads for hashing (default: 4)
Output:
--json FILE Export results to JSON file
--csv FILE Export results to CSV file
Actions:
--delete Delete duplicate files
--move FOLDER Move duplicates to specified folder
--keep {oldest,newest,first}
Which file to keep (default: oldest)
--dry-run Simulate actions without making changes
-i, --interactive Confirm each deletion
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β π UNIVERSAL FILE DEDUPLICATOR v1.0.0 β
β Metaphy LLC - 2025 β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π Scanning: /home/user/Documents
Options: min_size=1, recursive=True
π Phase 1: Collecting files...
Found 15,432 files
β‘ Phase 2: Quick hash filtering...
1,247 files need full hash verification
π Phase 3: Full hash verification...
π Phase 4: Building duplicate groups...
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π DUPLICATE FILE REPORT
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π΄ Found 23 groups of duplicate files
Total duplicates: 47
Wasted space: 1.23 GB
----------------------------------------------------------------------
π Group 1: 156.78 MB Γ 3 files
Hash: a1b2c3d4e5f6g7h8...
Wasted: 313.56 MB
π’ /home/user/Documents/backup/video.mp4
π΄ /home/user/Documents/projects/video.mp4
π΄ /home/user/Downloads/video.mp4
π Group 2: 45.32 MB Γ 2 files
Hash: 9f8e7d6c5b4a3210...
Wasted: 45.32 MB
π’ /home/user/Documents/photos/vacation.jpg
π΄ /home/user/Documents/photos/vacation_copy.jpg
----------------------------------------------------------------------
π SCAN STATISTICS
----------------------------------------------------------------------
Files scanned: 15,432
Files hashed: 1,247
Data scanned: 24.56 GB
Duplicates found: 47
Wasted space: 1.23 GB
Scan time: 12.34 seconds
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Always use --dry-run first to see what would be deleted:
python deduplicator.py /data --delete --dry-runConfirm each file before deletion:
python deduplicator.py /data --delete -iMove duplicates to a folder for manual review:
python deduplicator.py /data --move ./review_theseChoose which file to keep:
--keep oldest- Keep the file with oldest modification date (default)--keep newest- Keep the most recently modified file--keep first- Keep the first file found during scan
-
Phase 1: Size Grouping - Files are grouped by size. Files with unique sizes can't be duplicates.
-
Phase 2: Quick Hash - Files of same size get a quick hash (first 4KB + last 4KB + size). This eliminates most non-duplicates quickly.
-
Phase 3: Full Hash - Only files with matching quick hashes get full SHA-256 verification.
-
Phase 4: Report - True duplicates are grouped and sorted by wasted space.
This multi-phase approach makes scanning fast even for large directories.
# Find duplicate downloads
python deduplicator.py ~/Downloads --min-size 1048576 -v
# Move duplicates for review
python deduplicator.py ~/Downloads --move ~/Downloads/duplicates# Find duplicate photos
python deduplicator.py ~/Photos --ext jpg jpeg png heic --json photo_dupes.json
# Review the JSON, then delete
python deduplicator.py ~/Photos --ext jpg jpeg png heic --delete --dry-run# Exclude build artifacts
python deduplicator.py ~/projects --exclude node_modules .git __pycache__ dist buildMIT License - see LICENSE for details.
Contributions welcome! Please feel free to submit a Pull Request.
- Author: Logan Smith
- Company: Metaphy LLC
- Email: logan@metaphysicsandcomputing.com
Created by Randell Logan Smith and Team Brain at Metaphy LLC
Part of the HMSS (Heavenly Morning Star System) ecosystem.