Skip to content

A CLI genome assembler, based on de Bruijn graph method, written in Python

License

Notifications You must be signed in to change notification settings

jakubguzek/genome-assembler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Assembly module

For this project I created an assembly module with a few of python scripts.

I used de Bruijn graph approach to create contigs from reads. Reads should be provided in a fasta file, and the assembly script will output results also in a fasta file.

The module was tested for python version 3.10

Example usage:

$ ./assembly input.fasta output.fasta

I tried a lot of different methods for prunning the erroneous branches and paths from the graph, and that was probably the hardest step of the project. The approach that finally seems to work is however hardly elegant and could be improved on dramatically.

For error correction the script searches for neighbors of rare (and thus probably erroneous) k-mers in space of strings within given Hamming distance of given k-mer. This Hamming distance can be set optionally while running assembly script using '-m' parameter (2 by default). However this search is based on brute-force generation of all possible strings within given Hamming distance and therefore is slow for Hamming distances higher than 2. Threshold for a number of k-mer occurrences in k-mer spectrum of data can also be set, using '-t' option (2 by default). However k-mers, occurring only once in k-mer spectrum are always dropped if they cannot be corrected.

DeBruijn class allows for creating a graph from list of reads, hashmap representing k-mer spectrum, or it can be also initialized from ready weighted adjacency list representation.

Assembly module first parses data from fasta file (eagerly and not lazily so it's very memory consuming), and then corrects the data using information from k-mer spectrum. Corrected data is then used to build de Bruijn graph. Then the graph is prunned (short tips and low-coverage bubbles are removed). After that list of strongly connected components of a graph is created. Each conting is then generated by greedy superstring algorithm, applied to k-mers of each component.

Generated contings are finally written to fasta file.

For more information about optional arguments:

$ ./assembly -h

About

A CLI genome assembler, based on de Bruijn graph method, written in Python

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages