MANIAC stands for MMseqs2-based Average Nucleotide Identity Accurate Calculator. It is a bioinformatic pipeline, written using SnakeMake, for rapid and accurate calculation of average nucleotide identity (ANI) and Alignment Fraction (AF) between viral genomes. The goal of MANIAC is to provide a user-friendly and efficient tool for researchers in genomics, bioinformatics, and virology. MANIAC has been developed and optimised for bacteriophages but in principle can be used on any microbial genomes.
- High throughput: MANIAC can efficiently process large datasets (thousands) of viral genomes.
- Accurate: Uses MMseqs2 to ensure accurate calculation of average nucleotide identity (ANI) and alignment fraction (AF).
- Comprehensive: Provides analysis at both nucleotide and amino-acid level.
- User-friendly: Easy-to-use Snakemake workflow.
- Reproducible: Conda-based installation support ensures reproducibility.
The standard and quickest way of ANI calculation is based on the approach proposed by Goris et al. for bacterial genomes [1]. Specifically, each query is chopped into short fragments of pre-defined length (by default 1020 nt). Then, each fragment is aligned with the subject and the best hit is found – but only if the query coverage is at least 70% and the sequence identity is 30% across the entire query length. ANI is then taken as the mean percentage identity of all aligned fragments and query AF is calculated as the length of the aligned query genome (i.e., the summed length of all aligned fragments) to the full query length.
In addition to the standard, fragment-based ANI calculation, MANIAC carries out the calculation using best-bidirectional hits approach should the user provide coding sequences (CDSs) for input genomes, either in nucleotide or amino-acid. The calculation is then carried out analogously as in the fragment mode with the following differences:
- CDS are being used instead of fragments
- To calculate ANI and AF, in both query and subject only CDSs which are each others best hits are considered.
First clone the GitHub directory
git clone https://github.com/bioinf-mcb/MANIAC
conda create -n maniac -c conda-forge mamba python=3.9
conda activate maniac
mamba install -c conda-forge -c bioconda bash snakemake pandas biopython=1.79 mmseqs2 r-base r-essentials r-arrow datamash
cd MANIAC
snakemake --cores 8 --quiet --snakefile MANIAC --configfile test/configs/easy-fragment-based.yml
snakemake --cores 8 --quiet --snakefile MANIAC --configfile test/configs/easy-cds-aa.yml
snakemake --cores 8 --quiet --snakefile MANIAC --configfile test/configs/easy-cds-nt.yml
- python=3.9
- bash=5.2.21
- r-base=4.4.1
- r-essentials=4.4
- r-arrows=17.0.0
- snakemake=8.5
- pandas=2.2
- biopython=1.79
- mmseqs2=15.6
- datamash=1.8
This section will guide you on how to prepare your input files, create a yaml configuration file, and run the MANIAC software. We'll also cover the types of output files you can expect from MANIAC.
MANIAC requires one of two types of input files:
- Full genome files (for the fragment calculation),
- Nucleotide or amino-acid coding-sequences (for the BBH calculation).
Each file should be in FASTA format. The header convention for CDS input should be the genome name, followed by a _CDS
sting, followed by its unique suffix. For example, if genome named XYZ_phageVp123 has three coding sequences, the input file headers could be
>XYZ_phageVp123_CDS1
, >XYZ_phageVp123_CDS2
and >XYZ_phageVp123_CDS5
Examples of input files are located in test/data
.
MANIAC uses a yaml configuration file to set the workflow parameters. Here's an example of what a simple configuration file might look like:
INPUT_FILE: "test/data/fragment-based.fasta"
OUTPUT_DIR: "test/output/FRAGMENT-BASED"
MODE: DNA_FRAGMENTS
FAST: False
Here are details of various parameters.
INPUT_FILE
: full genome or CDS fileOUTPUT_DIR
: directory where the output should be writtenMODE
: FRAGMENTS_NT requires full genomes as an input, while CDS_NT and CDS_AA use BBH to calculate ANI and require the input to be CDS (nucleotide or protein respectively) [FRAGMENTS_NT | CDS_NT | CDS_AA]FAST
: Enable Fast mode. Fast mode will overwrite some parameters to prioritize speed over accuracy (KMER: 15) [True/False]
COVERAGE
: minimal query coverage used for filtering (default:0.7
)IDENTITY
: minimal query identity used for filtering (default:0.3
)FRAGMENT_SIZE
: length of the genome fragments to be used in search (default:1020
)
HOMOLOGS:
BBH & homologous CDS definitionIDENTITY
: (default:0.3
)COVERAGE
: (default:0.7
)
CONSERVED
: conservative CDS definitionIDENTITY
: (default:0.8
)COVERAGE
: (default:0.5
)
DELETE_INTERMEDIATE_FILES
: [True/False] (default:True
)MEMORY_EFFICIENT
: mode used to run in a memory stringent manner. Only loads table columns that are important for the analysis and drops all columns that are not used for ANI calculation [True/False] (default:True
)MMSEQS_PARAMS
: any additional parameters to be passed to MMseqs2 search, default values calibrated with PyaniEVALUE
: (default:1e-15
)SENSITIVITY
: (default:7.5
)ZDROP
: (default:150
)MAX_SEQS
: (default:10000
)MAX_SEQ_LEN
: (default:100000
)KMER
: (default:100000
)SEED_SUB_MATRIX
: (default:scoring/blastn-scoring.out
)SUB_MATRIX
: (default:scoring/blastn-scoring.out
)
For full genome and nucleotide CDS mode, the alignment scoring matrix should be provided. Matrices for the blastn and unit-scoring modes are provided in the repository. Please note that the sensitivity parameter will not matter for nucleotide-based calculations, only k-mer size will. If FAST is enabled, k-mer size will be forced to 15.
For amino-acid calculations, no scoring matrix has to be provided but a more sensitive search is recommended (such as -s 7.5
or higher). Please refer to the original mmseqs publication [2].
Examples of input files for different calculation modes are located in test/configs
. A minumum working example is provided, as well as different examples with more complete sets of parameters for advanced users. We strongly recommend against changing the mmseqs input parameters as they have been optimised for different calculation modes.
After your input files are ready and your configuration file is set, you can run MANIAC as follows:
snakemake --cores 8 --quiet --snakefile MANIAC --configfile your-path-to-configuration-file.yml
where your-path-to-configuration-file.yml
is the full path to your configuration file. The type of the configuration file will determine whether MANIAC runs in the fragment mode or the BBH mode. cores
should be adapted to the machine you are using to run MANIAC.
Maniac generates output files in the user-defined output directory. The genome-alignment.csv
file contains the ANI results along with associated metrics. The file is a table with fields detailed below:
Metrics | Description |
---|---|
ANI | Average nucleotide identity between the query and reference sequences |
len_1 | The length of the query sequence |
len_2 | The length of the reference sequence |
ani_alnlen | The total length of aligned nucleotides between the query and reference sequences |
af_1 | Alignment fraction of the query sequence calculated by dividing the aligned length by the total length of the query sequence |
af_2 | Alignment fraction of the reference sequence calculated by dividing the aligned length by the total length of the reference sequence |
af_min | The minimum alignment fraction between the query and reference sequence calculated by dividing the aligned nucleotide length by the shorter sequence between the query and reference sequence |
af_max | The maximum alignment fraction between the query and reference sequence calculated by dividing the aligned nucleotide length by the longer sequence between the query and reference sequence |
af_mean | Mean alignment fraction between the query and reference sequences. It is calculated by averaging the alignment fraction of both query and reference sequences weighted by their length. Users can also calculate af_mean by considering the alignment fraction between pairs since the results of MANIAC are asymmetrical i.e (af_1 + af_2)/2 |
af_jaccard | The jaccard index of the alignment fraction calculated as the ratio of the aligned length to the total length of the union of the query and reference sequences |
seq1_n_prots | Number of proteins or CDS in the query sequence |
seq2_n_prots | Number of proteins or CDS in the reference sequence |
min_prots | The minimum number of proteins or CDS between the query and reference sequences |
wGRR | wGRR is the weighted gene repertoire relatedness. It is calculated as the ratio of bi-directional best hits between the query and reference genomes weighted by the sequence identity of homologs (CDS or protein homologs for the CDS or protein mode respectively) |
wgANI | wgANI is the whole genome ANI. It si calculated by multiplying ANI by the mean AF |
- Goris, J. et al. DNA-DNA hybridization values and their relationship to whole-genome sequence similarities. Int. J. Syst. Evol. Microbiol. 57, 81–91 (2007).
- Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).
- Fragment-based calculation has duplicate entries (a-b & b-a)
- ANI for proteins (CDS) is AAI
- wGRR for ORFs is not sensu stricto wGRR