|
1 |
| -# LinearTurboFold |
| 1 | +# LinearTurboFold |
| 2 | + |
| 3 | +This repository contains the C++ source code for the LinearTurboFold project, an end-to-end linear-time algorithm for structural alignment and conserved structure prediction of RNA homologs, which is the first joint-fold-and-align algorithm to scale to full-length SARS-CoV-2 genomes without imposing any constraints on base-pairing distance. |
| 4 | + |
| 5 | +[LinearTurboFold: Fast Folding and Alignment for RNA Homologs with Applications to Coronavirus](https://www.biorxiv.org/content/10.1101/2020.11.23.393488v2) |
| 6 | + |
| 7 | +Sizhen Li, He Zhang, Liang Zhang, Kaibo Liu, Boxiang Liu, David Mathews*, Liang Huang* |
| 8 | + |
| 9 | +\* corresponding author |
| 10 | + |
| 11 | +# Dependency |
| 12 | +gcc 4.8.5 or above; <br> |
| 13 | +python2.7 |
| 14 | + |
| 15 | +# Compile |
| 16 | +``` |
| 17 | +Make |
| 18 | +``` |
| 19 | + |
| 20 | +# Run |
| 21 | +LinearTurboFold can be run with: |
| 22 | +``` |
| 23 | +./linearturbofold -i input.fasta -o output_dir [OPTIONS] |
| 24 | +``` |
| 25 | +The input file should be in the FASTA format. Please see [input.fasta](input.fasta) as an example. <br> |
| 26 | +Output a multiple sequence alignment and predicted secondary structures in the output directory. |
| 27 | + |
| 28 | +### OPTIONS |
| 29 | +`--it` |
| 30 | +The number of iterations (default 3). <br> |
| 31 | +`--b1` |
| 32 | +The beam size for LinearAlignment (default 100, set 0 for infinite beam). <br> |
| 33 | +`--b2` |
| 34 | +The beam size for LinearPartition (default 100, set 0 for infinite beam). <br> |
| 35 | +`--pf` |
| 36 | +Save partition functions for all the sequencs after the last iteration (default False). <br> |
| 37 | +`--bpp` |
| 38 | +Save base pair probabilities for all the sequencs after the last iteration (default False). <br> |
| 39 | +`-v` |
| 40 | +Print out alignment, folding and runtime information (default False). <br> |
| 41 | +`--th` |
| 42 | +Set ThreshKnot threshknot (default 0.3). <br> |
| 43 | +`--tkit` |
| 44 | +Set ThreshKnot iterations (default 1). <br> |
| 45 | +`--tkhl` |
| 46 | +Set ThreshKnot minimum helix length (default 3). <br> |
| 47 | + |
| 48 | +### Example |
| 49 | +``` |
| 50 | +./linearturbofold -i input.fasta -o rets/ --pf --bpp |
| 51 | +100% [==================================================] |
| 52 | +3 iterations Done! |
| 53 | +Outputing partition functions to files ... |
| 54 | +Outputing base pair probabilities to files ... |
| 55 | +Outputing multiple sequence alignment to rets/output.aln... |
| 56 | +Outputing structures to files ... |
| 57 | +``` |
| 58 | + |
| 59 | +# Evalutation Dataset |
| 60 | +We used the [RNAStralign](https://rna.urmc.rochester.edu/publications.html) dataset with known alignments and structures to evaluate LinearTurboFold and benchmarks. |
| 61 | + |
| 62 | +# SARS-CoV-2 Dataset and Results |
| 63 | +The 25 SARS-CoV-2 and SARS-related genomes analyzed in the paper are listed in [samples25.fasta](data/sars-cov-2_data/samples25.fasta). <br> |
| 64 | +For further study by experts, |
| 65 | +we provide the whole multiple sequence alignment and predicted structures for all genomes from LinearTurboFold in [sars-cov-2_and_sars-related_25_genomes_msa_structures.txt](sars-cov-2_rets/sars-cov-2_and_sars-related_25_genomes_msa_structures.txt). <br> |
| 66 | +Each genome corresponds to three lines: sequence name, aligned sequence and aligned structure, respectively. |
| 67 | + |
0 commit comments