Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
ava1ar committed Sep 18, 2014
1 parent 797389b commit b9678fa
Showing 1 changed file with 14 additions and 1 deletion.
15 changes: 14 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
DupsFinder
==========

Search for duplicate files in the specified folder recursively. Initially, files are compared by size. For files with identical size, hash sum of first 1024 bytes are calculated and compared, if this sums are equal, hash sum are calculated for files and compared again. Files with identical hash sums are listed as equal.
Search for duplicate files in the specified folder recursively.

Wirtten in Java without additional dependencies. Requires Java7+ and Maven to build. SHA-1 is used as a hashing algo.

Expand All @@ -26,3 +26,16 @@ Output format
-------------

{SHA-1 sum}:{number of duplicates}:{file size in bytes}:"{file full path}"

Implementation details
----------------------

The main idea behind the application is multiple operations of grouping files by set of file properties and droping files with unique property value, so the remaining files in groups have equal property value. File size, hashsum of first 1024 bytes of file and hashsum of the complete file are used as the file properties for grouping. Every grouping operation excludes unique files and pass remaining to the next grouping operation, so at the end only groups of identical files remain.

During the first step program creates the collection of sets of files with the equal file size. All sets containing only single item (basically files with unique size) are dropped, remaining sets are passed to the second step as an input.

During the second step program creates the collection of sets of file with the equal hashsum of the first of 1024 bytes of the file. For optimization purposes hashsums calculation and grouping performed in multiple threads. All sets containing only single item are dropped, remaining sets are passed to the third step as an input.

During the second step program creates the collection of sets of file with the equal hashsum of the complete file. For optimization purposes hashsums calculation and grouping performed in multiple threads and hashsum is not calculated for files with size <= 1024 bytes: partial hashsum, calculated on the second step is used for such file instead. All sets containing only single item are dropped, remaining files in groups are identical.

After third step, the collection of sets of identical files is used to print the output.

0 comments on commit b9678fa

Please sign in to comment.