NCDSearch is a grep-like tool to find similar source code fragments in files. In software development, developers may copy and modify source code fragments to reuse existing functionalities in code. This tool is developed to detect such code fragments that are not identified by a simple search tool like grep.
For example, if a line if (this.distance < another.distance) return true;
(in source code of the tool) is given as a query, the tool reports similar lines such as
if (this.distance > another.distance) return false;
and if (thislen > anotherlen) return true;
.
You can find two bug fix examples in a company in our paper published in ICSME 2018 Industry Track.
Takashi Ishio, Naoto Maeda, Kensuke Shibuya, Katsuro Inoue: Cloned Buggy Code Detection in Practice Using Normalized Compression Distance. In proceedings of ICSME 2018, pp.591-594, November 2018.
The tool supports famous similarity metrics such as Normalized Compression Distance and Normalized Levenshtein Distance. The default search strategy is Lempel-Ziv Jaccard Distance (proposed in https://arxiv.org/abs/1708.03346). It is significantly faster than existing NCD, while it keeps similar output to NCD using the Deflate (zip) algorithm for code clone detection. The details are described in our technical paper.
Takashi Ishio, Naoto Maeda, Kensuke Shibuya, Kenho Iwamoto, Katsuro Inoue, NCDSearch: Sliding Window-Based Code Clone Search Using Lempel-Ziv Jaccard Distance. IEICE Transactions on Information and Systems, vol.E105-D, No.5, May 2022.
The project uses Maven with Maven Assembly Plugin.
The following command builds a runnable jar ncdsearch.jar
:
mvn package
The project includes Eclipse project file and pom.xml
for dependencies.
The main class is ncdsearch.SearchMain
.
The repository includes Lexer files generated by ANTLR4.
The grammar files are maintained in another repository (https://github.com/takashi-ishio/sarf-lexer)
Online usage example is available at:
- (English) https://colab.research.google.com/drive/1RRQS8jQWMIi2fdXmiTBcbwNB3FeZ74Tr
- (Japanese) https://colab.research.google.com/drive/1eL_92HCCWfVNPRPfc5msi7H_-v5jsx-G
The following table is a list of major options.
Option | Description |
---|---|
-e [query] |
Specify a query. All arguments after the option are regarded as a query. |
-q [filename] |
Read a query from a specified file. - represents a standard input. |
-sline [line-number] |
Specify a start line number of a query in the file of -q option. |
-eline [line-number] |
Specify an end line number of a query in the file of -q option. |
-lang [language] |
Specify a programming language of a query. Required for -e option. |
-th [threshold] |
Specify a threshold of the distance. The default is 0.5. |
-thread [num] |
Specify the number of threads for concurrent search. |
-encoding [encoding] |
Specify a text encoding of source files. The default encoding is UTF-8. |
-l [filelist] |
Search files listed in the specified text file (one file name per line) |
-git [git-dir] |
Search files in the specified git repository instead of normal files. |
-v |
Show configuration and progress. |
-json |
Enable a JSON format report. |
-pos |
Report the detected source code locations in detail. |
-a [algorithm] |
Specify an algorithm to compute a distance. The default is lzjd . |
-link [style] |
If one of eclipse , vscode , and fileurl is given, file names are printed using a clickable format on particular environments. The default is none (just a file name). |
-testconfig |
This option does not execute a search but print the current configuration. |
You can input code fragments using STDIN, a query file, or command line arguments.
java -jar ncdsearch.jar dir_or_file -lang java -q - < query
java -jar ncdsearch.jar dir_or_file -lang java -q query.txt
java -jar ncdsearch.jar dir_or_file -lang java -q query.txt -sline 10 -eline 20
java -jar ncdsearch.jar dir_or_file -lang java -e my code fragments
- The
-q
option specifies a file for a code fragment.-sline
and-eline
specifies lines of code in a file as a query. In the third example, lines 10 through 20 of the query.txt file are selected as a query.- If a file name is "-", the tool reads a query from the standard input.
- The
-e
option directly specifies a code fragment in command line arguments. Note that all the arguments after-e
are regarded as a query code snippet.
You can specify multiple directories or files to be searched.
java -jar ncdsearch.jar dir1 dir2 -lang java < query
The tool recognizes a programming language by either -lang
option or a file name of -q
option. The -lang
option overrides -q
option's file name.
It should be noted that the tool reports all similar code fragments by default.
If you would like to see only exact matches, you can use -a tld
option that uses Token-level Levenshtein Distance instead of the default strategy.
java -jar ncdsearch.jar dir_or_file -lang java -e public static void main(String[] args)
The tool reports a result in a CSV format by default. For example, an execution with a query:
java -jar ncdsearch.jar -lang java -e "if (this.distance > another.distance) return true;"
would report a line like this:
path/to/src/ncdsearch/Fragment.java,81,81,0.10714285714285714
Each line of an output represents a similar source code fragment detected by the tool.
- The first column is the file name including the code fragment.
- The second and third columns indicate the lines of the first and last tokens of the fragment.
- You may specify
-pos
option to extract char positions in the lines.
- You may specify
- The last column indicates the normalized compression distance between the query and the code fragment. Since it is a distance, more similar code fragments have smaller values.
According to the report, you may find a similar line of code in a file. For example:
81: if (this.distance < another.distance) return true;
The tool also supports a JSON format. Add -json
option to use the format.
The -pos
option with the json option reports source code tokens in addition to locations.
The tool assumes UTF-8 by default.
Please specify -encoding
option to choose a charset, e.g. -encoding UTF-16
.
A list of supported encodings is dependent on a platform.
A list for Oracle Java SE is available at: https://docs.oracle.com/javase/jp/8/docs/technotes/guides/intl/encoding.doc.html
The -lang
option specifies a programming language.
The tool accepts file extensions: java
(Java), c
(C/C++), cs
(C#), js
(JavaScript), cbl
(Cobol), vb
(Visual Basic 6) and txt
(plain text).
It also accepts ccfxprep
files that are generated by the CCFinderX preprocessor.
A programming language option activates a lexical analysis to extract tokens. It ignores white space and comments in the specified language. The plain text mode regards a single line as a single token but ignores indentation (leading white space and trailing white space of lines).
Each language option automatically searches source files using the following extensions.
- C/C++: .c, .cc, .cp, .cpp, .cx, .cxx, .c+, .c++, .h, .hh, .hxx, .h+, .h++, .hp, .hpp
- Java 8: .java
- JavaScript: .js
- C#: .cs
- Python: .py
- COBOL: .cbl
- Visual Basic 6 and .NET (experimental): .vb
- Plain Text: .txt, .html, .md
- CCFinderX: .ccfx, .ccfinderx, .ccfxprep
You can include additional files using -i (extension)
option.
For example, the following command searches .java files as plain text.
java -jar ncdsearch.jar dir_or_file -lang txt -i .java -e "// line comment"
You can use multiple -i
options in a command line to search additional files.
You also can use -l [filename]
option that specifies a text file including a list of file names.
For efficiency, the tool compares a query with sampled lines of code by default. It is fast, but may result in false negatives.
If your query is small enough, you should specify -full
option that checks all tokens so that you can get more results.
java -jar ncdsearch.jar dir_or_file -lang java -full -e identifier
If a result is different from your expectation, you can try -v
to see the configuration and progress of the search.
java -jar ncdsearch.jar dir_or_file -lang java -v < query
The program uses N working threads if -thread N
option is provided.
The multi-threading execution processes N files in parallel.
File locations in the output may be differently ordered in each execution.
Although N can be an arbitrary number (e.g. 2, 4, or 8), an effective value of N is dependent on available CPU resources. A larger amount of memory is also required to store N files in memory at once.
You can specify an algorithm using -a
option.
This tool supports the following algorithms.
lzjd
: Lempel-Ziv Jaccard Distance. The default strategy.clzjd
: Character-level Lempel-Ziv Jaccard Distance. It internally useschar
instead ofbyte
for comparison. This may be important to find similar multi-byte strings.tld
: Token-level Levenshtein Distance. A distance is measured by the number of add/delete/modify operations to transform a query string to a code fragment in source code.ntld
: Normalized Token-level Levenshtein Distance. A distance is a variant of Token-level Levenshtein Distance normalized by the length of a query string.nbld
: Normalized Byte-level Levenshtein Distance. A distance is measured by the number of different bytes, normalized by the length of a query string..blcs
: Normalized Byte-level Longest Common Subsequence. A distance is measured by the length of common byte subsequence normalized by the length of a query string.zip
: Normalized Compression Distance with Deflate algorithm that has been used in gzip.xz
,zstd
,bzip2
,snappy
,folca
: Normalized Compression Distance with the corresponding compression algorithm. These algorithms are provided just for experiments to see an impact of compression algorithms.
Using the -a zip
option, you can use a traditional Normalized Compression Distance (https://en.wikipedia.org/wiki/Normalized_compression_distance) defined as follows:
NCD(x, y) = (Z(xy)-min(Z(x), Z(y)) / max(Z(x), Z(y))
where Z(x), Z(y), and Z(xy) are data size obtained by a data compression algorithm (Deflate in case of the zip
option).
If two data x
and y
are similar, then NCD(x, y) results in a small value.
The -a ntld
option uses Normalized Levenshtein Distance on tokens, i.e. the ratio of added, removed, and modified tokens, as a distance metric.
The -a tld
option uses Levenshtein Distance on tokens without normalization.
It simply counts the number of added, removed, and modified tokens and reports a code fragment whose distance is at most a given threshold.
For example, -q "a < b" -th 1 -full
matches a > b
and a < c
.
The evaluation-dataset/cbcd-dataset.json
file includes 53 queries created from the information in J. Li and M. D. Ernst: CBCD: Cloned Buggy Code Detector.
As we found several errors in the paper, we manually checked the version history of three projects and revised the queries and answers.
Each object includes a query source code fragment and code clones that should be detected.
query
: A file name including the query content.queryloc
: The source code location of the query.path
shows a project name (postgres/git/linux) and a commit ID including the code.file
,sline
, andeline
represent the file path, start line number and end line number in the version.
type
: Code clone type.path
: A project name and a commit ID including code clones.answers
: A list of source code locations. The format is the same asqueryloc
.
The source code of this project is available under the MIT License. The full license description is included in the LICENSE file.
This project uses the following components.
- GNU Trove (https://bitbucket.org/trove4j/trove): LGPL License
- ANTLR4 (http://www.antlr.org/license.html): BSD License
- XZ (https://tukaani.org/xz/java.html): public domain
- ZStd (https://github.com/luben/zstd-jni): 3-clause BSD License
Grammar files for ANTLR4 come from https://github.com/antlr/grammars-v4/ except for COBOL. C/C++ grammars are slightly edited so that the tool can recognize macro directives as tokens (see the header comment). COBOL grammar is obtained from ProLeap COBOL Parser (https://github.com/uwol/proleap-cobol-parser) that is distributed under MIT License. VisualBasic6 grammar is obtained from ProLeap VB6 Parser (https://github.com/uwol/vb6parser) that is distributed under MIT License.