Skip to content

prefix dereplication

Frédéric Mahé edited this page Aug 30, 2016 · 1 revision

To make it efficient, prefix dereplication works in a way that may sound complicated to some person. The goal of that page is to clarify the algorithm.

The query sequence is truncated, until a match is found. It first tries to find a full length match (ordinary full-length dereplication), if it cannot it will truncate the query, one nucleotide at a time until it finds a match or until it is too short (i.e. at least the length of the shortest sequence in the dataset). Each time it will compare the query with all the (full-length) sequences in the database using hashing. If a truncated query sequence matches one of the database sequences it means that the database sequence is a proper prefix of the query; there are no mismatches.

It is important to notice the following:

  • all sequences in the database are full-length sequences (no prefixes)
  • the query sequence is truncated, one nucleotide at a time, but just for use as a query against the database
  • the full length query is always inserted into the database
  • any matching sequence in the database (equal to a prefix of the query) is removed from the database
  • the data structure with cluster information is updated based on the matching sequences, it keeps track of the full-length or prefix replicates of each database sequence
  • the database is initially empty
  • the input sequences are sorted by length, used as queries, starting with the shortest

Here is a short example with the succession of logical operations:

input = ACGT and ACGTA (2 sequences)

- hash table is empty,
- continue with the next input sequence,
- ACGT is defined as the minimum_length,
- ACGT is looked for in the hash table,
- no hit,
- is len(ACGT) > minimum_length ? no,
- add ACGT to the hash table,
- continue with the next input sequence,
- ACGTA is looked for in the hash table,
- no hit,
- is len(ACGTA) > minimum_length ? yes,
- truncate ACGTA by one nucleotide,
- ACGT is looked for in the hash table,
- hit,
- add ACGTA to the hash table,
- remove ACGT from the hash table

Proceeding as such guarantees that a sequence is always merged with the closest possible longer sequence.