For this project I created an assembly module with a few of python scripts.
I used de Bruijn graph approach to create contigs from reads. Reads should be provided in a fasta file, and the assembly script will output results also in a fasta file.
The module was tested for python version 3.10
Example usage:
$ ./assembly input.fasta output.fasta
I tried a lot of different methods for prunning the erroneous branches and paths from the graph, and that was probably the hardest step of the project. The approach that finally seems to work is however hardly elegant and could be improved on dramatically.
For error correction the script searches for neighbors of rare (and thus probably erroneous) k-mers in space of strings within given Hamming distance of given k-mer. This Hamming distance can be set optionally while running assembly script using '-m' parameter (2 by default). However this search is based on brute-force generation of all possible strings within given Hamming distance and therefore is slow for Hamming distances higher than 2. Threshold for a number of k-mer occurrences in k-mer spectrum of data can also be set, using '-t' option (2 by default). However k-mers, occurring only once in k-mer spectrum are always dropped if they cannot be corrected.
DeBruijn
class allows for creating a graph from list of reads, hashmap representing k-mer spectrum,
or it can be also initialized from ready weighted adjacency list representation.
Assembly module first parses data from fasta file (eagerly and not lazily so it's very memory consuming), and then corrects the data using information from k-mer spectrum. Corrected data is then used to build de Bruijn graph. Then the graph is prunned (short tips and low-coverage bubbles are removed). After that list of strongly connected components of a graph is created. Each conting is then generated by greedy superstring algorithm, applied to k-mers of each component.
Generated contings are finally written to fasta file.
For more information about optional arguments:
$ ./assembly -h