Skip to content

Difference between abundance based and distance based greedy clustering (AGC vs. DGC)

Frédéric Mahé edited this page Jan 6, 2017 · 1 revision

AGC (abundance-based greedy clustering) works by assigning a new sequence to the most abundant centroid when several centroids exists within the given similarity threshold (e.g. 97%). DGC (distance-based greedy clustering) works by assigning a new sequence to the closest (most similar) centroid when several centroids exists within the given similarity threshold (e.g. 97%).

For more details about AGC vs. DGC please see this paper by Schloss.

AGC can be turned on in VSEARCH by specifying the --sizeorder option, while DGC is the default. However, AGC only works when the --maxaccepts options is specified with an argument larger than 1. VSEARCH uses heuristics to find the approximately most similar sequences first and then considers a number of them in detail (as many as specified with --maxaccepts). Among those accepted sequences, the most abundant centroid is chosen if --sizeorder is turned on. Due to the heuristic nature of the methods, the algorithm cannot guarantee to make the optimal choice.

The --sizeorder option only works with the clustering commands (--cluster_fast, --cluster_smallmem and --cluster_size), and no other command.