mvPPT website is at: http://www.mvppt.club/
or you can get scores at google drive.
A comprehensive prediction tool, mvPPT (Pathogenicity Prediction Tool for missense variants).
Three training sets based on different combinations of variants from ClinVar, HGMD, Uniprot, and Genome Aggregation Database (gnomAD).
Variants were annotated by the ANNOVAR.
- A) AFs, AAFs, and GFs of variants estimated from 125,748 exomes in gnomAD (version 2.1.1);
- B) genomic context of the variant, i.e., region/gene-based information from GeVIR, VIRLoF, oe mis upper, HIP, CCRs, Interpro domain, and amino acid sequence before and after mutation;
- C) pathogenicity likelihood scores assessed by different component tools, including MutationAssessor, SIFT, PROVEAN, GERP++ RS, phyloP, phastCons, and SiPhy.
We annotated datasets with ANNOVAR using dbNSFP (v.4.1a, see URLs) to generate some of the required prediction scores from different component tools, including Interpro domain, MutationAssessor, phyloP, GERP, phastCons, PROVEAN, and SiPhy. Mutations located in the interpro domains were recorded as 1 and the rest were recorded as 0. AFs, GFs, and AAFs of each variant in different populations were obtained from the gnomAD exomes database. AFs, AAFs, HomFs, and HetFs were assigned 0 and WtFs were assigned 1 if the variant was not present in the database. The GeVIR, VIRLoF, oe mis upper, HIP, and CCR scores were downloaded from their respective websites (see URLs). One-hot encoding has been applied to amino acid sequence, representing each amino acid with a binary vector of length 20 with a single non-zero value. All the features were selected to provide complementary information, and they either did not require training or their training data are publicly available to allow exclusion from our data.
The MVP, REVEL, PrimateAI, FATHMM-XF, ClinPred, MetaSVM/MetaLR, PolyPhen2, and VEST4 scores were obtained from dbNSFP v4.1a. The M-CAP (version 1.4), MISTIC , CAPICE ReVe, and CADD (version 1.6) scores were downloaded from their respective websites.
mvPPT was trained using the python package LightGBM (version 2.3.1), and
parameters were tuned by Bayesian optimization(version 1.2.0). The
random status was set as 1
throughout the model training process.
The environments of mvPPT built in our study:
- python 3.7.4
- sklearn 0.22.1
- numpy 1.17.3
- scipy 1.4.1
- pandas 0.25.3
- matplotlib 3.1.2
- lightGBM 2.3.1
- bayesian-optimization 1.1.0
- runtime environment (We recommend conda)
conda create -n mvppt python==3.7
conda activate mvppt
conda install --file=requirements_conda.txt
- annotation
- You need to apply for annovar and add the scripts to
annovar
- We upload some database to
annodb
because of the data size - You need to unzip the files in the
annodb
- you can download
ensGene
anddbnsfp35a
by annovar - we provide the demo of
gnomad211exoms_allpop
andp6b
, you have to get it by yourself fromgnomad_v211
anddbnsfp41a
- for
gnomad_AAF.txt.gz
cat xaa xab > gnomad_AAF.txt.gz gunzip gnomad_AAF.txt.gz
- You need to apply for annovar and add the scripts to
bash src/anno.sh filename.vcf
python3 src/annoseq.py filenameGeneUniqAnno.txt filenameTotalGeneUniqAnnoEnsSeq.txt
- predict
python3 src/predict.py filenameTotalGeneUniqAnnoEnsSeq.txt