ClaMSA is tool that uses machine learning to classify sequences that are related to each other via a tree. It takes as input a tree and a multiple sequence alignment (MSA) of sequences and outputs probabilities that the MSA belongs to given classes. It is currently trained and tested to classify sequences of codons (= triplets of DNA characters) into coding (1) or non-coding (0). It builds on TensorFlow and a custom layer for Continuous-Time Markov Chains (CTMC) and trains a set of rate matrices for a classification task.
Above image shows two toy example input MSAs. Synonymous codons, which code for the same amino acid, have the same color.
Python modules
- tensorflow >= 2.0
- biopython
- regex
- newick
- tqdm
- pandas
- protobuf3-to-dict
- matplotlib
- seaborn
Install requirements with
pip3 install tensorflow biopython regex newick tqdm pandas protobuf3-to-dict matplotlib seaborn
Download ClaMSA with
git clone --recurse-submodules https://github.com/Gaius-Augustus/clamsa.git
The commands
cd clamsa
./clamsa.py predict fasta examples/msa.lst --clades examples/example_tree.nwk --use_codons
output the table
path clamsa
examples/msa1.fa 0.9539
examples/msa2.fa 0.1667
Here, the two toy example alignments msa1
, msa2
pictured above are predicted to precoding with probabilities 0.9539 and 0.1667, respectively.
See the usage of prediction for an explanation of the command line structure.
See test/predict.sh for more explanations and a realistical application.
For codon MSA classification we recommend that you construct a tree the following way:
- Construct a set of codon MSAs just as you would do for prediction. You only need positive examples, i.e. alignments of actual coding sequences. One option to compile such a set is AUGUSTUS-CGP.
- Construct a tree with MrBayes using a codon model as described in the supplementary material to below paper.
Other trees may work, but a good performance should only be expected if the tree is scaled to 1 expected codon mutation per time unit.
Obtain
- codon alignment training data from a fly, vertebrate and yeast clade in tfrecords format and
- codon alignment test data from vertebrates in fasta format with
cd data
./download_fly_vert_yeast_train.sh
./download_vert_test.sh
ClaMSA can be trained for a classification task on a training set of labeled MSAs.
See test/train.sh for more explanations and the command line that ClaMSA was trained with.
- clamsa predict
- clamsa train
- clamsa convert (MSA conversion)
Most of ClaMSA was written by Darvin Mertsch.
Please cite:
End-to-end Learning of Evolutionary Models to Find Coding Regions in Genome Alignments, Darvin Mertsch and Mario Stanke, Bioinformatics, btac028, published 21 Jan 2022