mvPPT

mvPPT website is at: http://www.mvppt.club/
or you can get scores at google drive. A comprehensive prediction tool, mvPPT (Pathogenicity Prediction Tool for missense variants).

Training data

Three training sets based on different combinations of variants from ClinVar, HGMD, Uniprot, and Genome Aggregation Database (gnomAD).

Annotation

Variants were annotated by the ANNOVAR.

A) AFs, AAFs, and GFs of variants estimated from 125,748 exomes in gnomAD (version 2.1.1);
B) genomic context of the variant, i.e., region/gene-based information from GeVIR, VIRLoF, oe mis upper, HIP, CCRs, Interpro domain, and amino acid sequence before and after mutation;
C) pathogenicity likelihood scores assessed by different component tools, including MutationAssessor, SIFT, PROVEAN, GERP++ RS, phyloP, phastCons, and SiPhy.

We annotated datasets with ANNOVAR using dbNSFP (v.4.1a, see URLs) to generate some of the required prediction scores from different component tools, including Interpro domain, MutationAssessor, phyloP, GERP, phastCons, PROVEAN, and SiPhy. Mutations located in the interpro domains were recorded as 1 and the rest were recorded as 0. AFs, GFs, and AAFs of each variant in different populations were obtained from the gnomAD exomes database. AFs, AAFs, HomFs, and HetFs were assigned 0 and WtFs were assigned 1 if the variant was not present in the database. The GeVIR, VIRLoF, oe mis upper, HIP, and CCR scores were downloaded from their respective websites (see URLs). One-hot encoding has been applied to amino acid sequence, representing each amino acid with a binary vector of length 20 with a single non-zero value. All the features were selected to provide complementary information, and they either did not require training or their training data are publicly available to allow exclusion from our data.

The MVP, REVEL, PrimateAI, FATHMM-XF, ClinPred, MetaSVM/MetaLR, PolyPhen2, and VEST4 scores were obtained from dbNSFP v4.1a. The M-CAP (version 1.4), MISTIC , CAPICE ReVe, and CADD (version 1.6) scores were downloaded from their respective websites.

Training

mvPPT was trained using the python package LightGBM (version 2.3.1), and parameters were tuned by Bayesian optimization(version 1.2.0). The random status was set as 1 throughout the model training process.

Environments

The environments of mvPPT built in our study:

python 3.7.4
sklearn 0.22.1
numpy 1.17.3
scipy 1.4.1
pandas 0.25.3
matplotlib 3.1.2
lightGBM 2.3.1
bayesian-optimization 1.1.0

How to predict the score by yourself

runtime environment (We recommend conda)

conda create -n mvppt python==3.7
conda activate mvppt
conda install --file=requirements_conda.txt

annotation
- You need to apply for annovar and add the scripts to annovar
- We upload some database to annodb because of the data size
- You need to unzip the files in the annodb
- you can download ensGene and dbnsfp35a by annovar
- we provide the demo of gnomad211exoms_allpop and p6b, you have to get it by yourself from gnomad_v211 and dbnsfp41a
- for gnomad_AAF.txt.gz
```
cat xaa xab > gnomad_AAF.txt.gz
gunzip gnomad_AAF.txt.gz
```

bash src/anno.sh filename.vcf
python3 src/annoseq.py filenameGeneUniqAnno.txt filenameTotalGeneUniqAnnoEnsSeq.txt

predict

python3 src/predict.py filenameTotalGeneUniqAnnoEnsSeq.txt

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
annodb		annodb
modelFile		modelFile
src		src
LICENSE		LICENSE
README.md		README.md
requirements_conda.txt		requirements_conda.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mvPPT

Training data

Annotation

Training

Environments

How to predict the score by yourself

About

Releases

Packages

Languages

License

tongshiyuan/mvPPT

Folders and files

Latest commit

History

Repository files navigation

mvPPT

Training data

Annotation

Training

Environments

How to predict the score by yourself

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages