Skip to content

Annotation From Protein Level

Wanding Zhou edited this page Feb 2, 2016 · 17 revisions

Protein level inputs are handled by the panno subcommand.

Protein sites

To use uniprot id as protein name, one must first download the uniprot id map by

transvar config --download_idmap

Then one could use protein id instead of gene name by applying the --uniprot option to TransVar. For example,

$ transvar panno --ccds -i 'Q5VUM1:47' --uniprot
Q5VUM1:47	CCDS4972 (protein_coding)	C6ORF57	+
   chr6:g.71289191_71289193/c.139_141/p.47S	cds_in_exon_2
   protein_sequence=S;cDNA_sequence=TCC;gDNA_sequence=TCC;source=CCDS

TransVar use a keyword extension ref in Q5VUM1:p.47refS to differentiate from the synonymous mutation Q5VUM1:p.47S. The former notation specifies that the reference protein sequence is S while the later specifies the target protein sequence is S.

Protein motif

For example, one can find the genomic location of a DRY motif in protein P28222 by issuing the following command,

$ transvar panno -i 'P28222:p.146_148refDRY' --uniprot --ccds
P28222:p.146_148refDRY	CCDS4986 (protein_coding)	HTR1B	-
   chr6:g.78172677_78172685/c.436_444/p.D146_Y148	cds_in_exon_1
   protein_sequence=DRY;cDNA_sequence=GACCGCTAC;gDNA_sequence=GTAGCGGTC;source=C
   CDS

One can also use wildcard x (lowercase) in the motif.

$ transvar panno -i 'HTR1B:p.365_369refNPxxY' --ccds
HTR1B:p.365_369refNPxxY	CCDS4986 (protein_coding)	HTR1B	-
   chr6:g.78172014_78172028/c.1093_1107/p.N365_Y369	cds_in_exon_1
   protein_sequence=NPIIY;cDNA_sequence=AAC..TAT;gDNA_sequence=ATA..GTT;source=C
   CDS

Protein range

$ transvar panno --ccds -i 'ABCB11:p.200_400'

outputs

ABCB11:p.200_400	CCDS46444 (protein_coding)	ABCB11	-
   chr2:g.169833195_169851872/c.598_1200/p.T200_K400	cds_in_exons_[6,7,8,9,10,11]
   protein_sequence=TRF..DRK;cDNA_sequence=ACA..AAA;gDNA_sequence=TTT..TGT;sourc
   e=CCDS

Single amino acid substitution

Mutation formats acceptable in TransVar are PIK3CA:p.E545K or without reference or alternative amino acid identity, e.g., PIK3CA:p.545K or PIK3CA:p.E545. TransVar takes native HGVS format inputs and outputs. The reference amino acid is used to narrow the search scope of candidate transcripts. The alternative amino acid is used to infer nucleotide change which results in the amino acid.

$ transvar panno -i PIK3CA:p.E545K --ensembl

outputs

PIK3CA:p.E545K	ENST00000263967 (protein_coding)	PIK3CA	+
   chr3:g.178936091G>A/c.1633G>A/p.E545K	cds_in_exon_10
   reference_codon=GAG;candidate_codons=AAG,AAA;candidate_mnv_variants=chr3:g.17
   8936091_178936093delGAGinsAAA;dbsnp=rs104886003(chr3:178936091G>A);missense;a
   liases=ENSP00000263967;source=Ensembl

One may encounter ambiguous cases where the multiple substitutions exist in explaining the amino acid change. For example,

$ transvar panno -i ACSL4:p.R133R --ccds
ACSL4:p.R133R	CCDS14548 (protein_coding)	ACSL4	-
   chrX:g.108926078G>T/c.399C>A/p.R133R	cds_in_exon_2
   reference_codon=CGC;candidate_codons=AGG,AGA,CGA,CGG,CGT;candidate_snv_varian
   ts=chrX:g.108926078G>C,chrX:g.108926078G>A;candidate_mnv_variants=chrX:g.1089
   26078_108926080delGCGinsCCT,chrX:g.108926078_108926080delGCGinsTCT;synonymous
   ;source=CCDS

In those cases, TransVar prioritizes all the candidate base changes by minimizing the edit distance between the reference codon sequence and the target codon sequence. One of the optimal base changes is arbitrarily chosen as the default and all the candidates are included in the appended CddMuts entry.

annotate with additional resources

For example, one could annotate SNP with dbSNP id by downloading the dbSNP files. This can be done by

transvar config --download_dbsnp

TransVar automatically download dbSNP file which correspoding to the current default reference version (as set in transvar.cfg). This also sets the entry in transvar.cfg. With dbSNP file downloaded, TransVar automatically looks for dbSNP id when performing annotation.

$ transvar panno -i 'A1CF:p.A309A' --ccds
A1CF:p.A309A	CCDS7243 (protein_coding)	A1CF	-
   chr10:g.52576004T>G/c.927A>C/p.A309A	cds_in_exon_7
   reference_codon=GCA;candidate_codons=GCC,GCG,GCT;candidate_snv_variants=chr10
   :g.52576004T>C,chr10:g.52576004T>A;dbsnp=rs201831949(chr10:52576004T>G);synon
   ymous;source=CCDS

Note that in order to use dbSNP, one must download the dbSNP database through transvar config --download_dbsnp, or by configure the dbsnp slot in the configure file via transvar config -k dbsnp -v [path to dbSNP VCF]. Manually set path for dbSNP file must have the file tabix indexed.