Skip to content

Annotation From Protein Level

Wanding Zhou edited this page Feb 2, 2016 · 17 revisions

Protein level inputs are handled by the panno subcommand.

Protein sites

To use uniprot id as protein name, one must first download the uniprot id map by

transvar config --download_idmap

Then one could use protein id instead of gene name by applying the --uniprot option to TransVar. For example,

$ transvar panno --ccds -i 'Q5VUM1:47' --uniprot
Q5VUM1:47	CCDS4972 (protein_coding)	C6ORF57	+
   chr6:g.71289191_71289193/c.139_141/p.47S	cds_in_exon_2
   protein_sequence=S;cDNA_sequence=TCC;gDNA_sequence=TCC;source=CCDS

TransVar use a keyword extension ref in Q5VUM1:p.47refS to differentiate from the synonymous mutation Q5VUM1:p.47S. The former notation specifies that the reference protein sequence is S while the later specifies the target protein sequence is S.

Protein motif

For example, one can find the genomic location of a DRY motif in protein P28222 by issuing the following command,

$ transvar panno -i 'P28222:p.146_148refDRY' --uniprot --ccds
P28222:p.146_148refDRY	CCDS4986 (protein_coding)	HTR1B	-
   chr6:g.78172677_78172685/c.436_444/p.D146_Y148	cds_in_exon_1
   protein_sequence=DRY;cDNA_sequence=GACCGCTAC;gDNA_sequence=GTAGCGGTC;source=C
   CDS

One can also use wildcard x (lowercase) in the motif.

$ transvar panno -i 'HTR1B:p.365_369refNPxxY' --ccds
HTR1B:p.365_369refNPxxY	CCDS4986 (protein_coding)	HTR1B	-
   chr6:g.78172014_78172028/c.1093_1107/p.N365_Y369	cds_in_exon_1
   protein_sequence=NPIIY;cDNA_sequence=AAC..TAT;gDNA_sequence=ATA..GTT;source=C
   CDS

Protein range

$ transvar panno --ccds -i 'ABCB11:p.200_400'

outputs

ABCB11:p.200_400	CCDS46444 (protein_coding)	ABCB11	-
   chr2:g.169833195_169851872/c.598_1200/p.T200_K400	cds_in_exons_[6,7,8,9,10,11]
   protein_sequence=TRF..DRK;cDNA_sequence=ACA..AAA;gDNA_sequence=TTT..TGT;sourc
   e=CCDS

Single amino acid substitution

Mutation formats acceptable in TransVar are PIK3CA:p.E545K or without reference or alternative amino acid identity, e.g., PIK3CA:p.545K or PIK3CA:p.E545. TransVar takes native HGVS format inputs and outputs. The reference amino acid is used to narrow the search scope of candidate transcripts. The alternative amino acid is used to infer nucleotide change which results in the amino acid.

$ transvar panno -i PIK3CA:p.E545K --ensembl

outputs

PIK3CA:p.E545K	ENST00000263967 (protein_coding)	PIK3CA	+
   chr3:g.178936091G>A/c.1633G>A/p.E545K	cds_in_exon_10
   reference_codon=GAG;candidate_codons=AAG,AAA;candidate_mnv_variants=chr3:g.17
   8936091_178936093delGAGinsAAA;dbsnp=rs104886003(chr3:178936091G>A);missense;a
   liases=ENSP00000263967;source=Ensembl

One may encounter ambiguous cases where the multiple substitutions exist in explaining the amino acid change. For example,

$ transvar panno -i ACSL4:p.R133R --ccds
ACSL4:p.R133R	CCDS14548 (protein_coding)	ACSL4	-
   chrX:g.108926078G>T/c.399C>A/p.R133R	cds_in_exon_2
   reference_codon=CGC;candidate_codons=AGG,AGA,CGA,CGG,CGT;candidate_snv_varian
   ts=chrX:g.108926078G>C,chrX:g.108926078G>A;candidate_mnv_variants=chrX:g.1089
   26078_108926080delGCGinsCCT,chrX:g.108926078_108926080delGCGinsTCT;synonymous
   ;source=CCDS

In those cases, TransVar prioritizes all the candidate base changes by minimizing the edit distance between the reference codon sequence and the target codon sequence. One of the optimal base changes is arbitrarily chosen as the default and all the candidates are included in the appended CddMuts entry.

Insertion

$ transvar panno --ccds -i 'AATK:p.P1331_A1332insTP'
AATK:p.P1331_A1332insTP	CCDS45807 (protein_coding)	AATK	-
   chr17:g.(79093267ins6)/c.(3997_3991ins6)/p.T1330_P1331dupTP	cds_in_exon_13
   left_align_protein=p.A1326_P1327insPT;unalign_protein=p.T1330_P1331dupTP;inse
   rtion_cDNA=ACACCT;insertion_gDNA=AGGTGT;imprecise;source=CCDS

Deletion

$ transvar panno --ccds -i 'AADACL4:p.W263_I267delWRDAI'
AADACL4:p.W263_I267delWRDAI	CCDS30590 (protein_coding)	AADACL4	+
   chr1:g.12726309_12726323del/c.787_801del/p.W263_I267delWRDAI	inside_[cds_in_exon_4]
   left_align_protein=p.W263_I267delWRDAI;unalign_protein=p.W263_I267delWRDAI;im
   precise;source=CCDS

Block substitution

$ transvar panno --ccds -i 'ABCC3:p.Y556_V557delinsRRR'
ABCC3:p.Y556_V557delinsRRR	CCDS32681 (protein_coding)	ABCC3	+
   chr17:g.48745254_48745259delinsAGGAGGAGG/c.1666_1671delinsAGGAGGAGG/p.Y556_V557delinsRRR	cds_in_exon_13
   imprecise;source=CCDS

Frame-shift variants

$ transvar panno --ccds -i 'A1BG:p.G132fs*2'
A1BG:p.G132fs*2	CCDS12976 (protein_coding)	A1BG	-
   chr19:g.58863868delC/c.395delG/p.G132fs*2	cds_in_exon_4
   left_align_cDNA=c.394delG;left_align_gDNA=g.58863867delC;candidates=g.5886387
   3delG/c.393delC/g.58863869delG/c.389delC;source=CCDS

TransVar can also take protein identifiers such as as input. For example,

$ transvar panno --refseq -i 'NP_006266.2:p.G240Afs*50'
NP_006266.2:p.G240Afs*50	NM_006275 (protein_coding)	SRSF6	+
   chr20:g.42089385delA/c.717delA/p.G240Afs*50	cds_in_exon_6
   left_align_cDNA=c.714delA;left_align_gDNA=g.42089382delA;candidates=g.4208938
   7delG/c.719delG/g.42089386delG/c.718delG;dbxref=GeneID:6431,HGNC:10788,HPRD:0
   9054,MIM:601944;aliases=NP_006266;source=RefSeq

The output gives the exact details of the mutation on the DNA levels, properly right-aligned. The candidates fields also include other equally-likely mutation identifiers. candidates have the format [right-align-gDNA]/[right-align-cDNA]/[left-align-gDNA]/[left-align-cDNA] for each hit and , separation between hits.

Similar applies when the underlying mutation is an insertion. TransVar can infer insertion sequence of under 3 base pairs long. For example,

$ transvar panno -i 'AASS:p.I355Mfs*10' --ccds
AASS:p.I355Mfs*10	CCDS5783 (protein_coding)	AASS	-
   chr7:g.121753753_121753754insCC/c.1064_1065insGG/p.I355Mfs*10	cds_in_exon_9
   left_align_cDNA=c.1064_1065insGG;left_align_gDNA=g.121753753_121753754insCC;c
   andidates=g.121753753_121753754insGC/c.1064_1065insGC/g.121753753_121753754in
   sGC/c.1064_1065insGC,g.121753753_121753754insTC/c.1064_1065insGA/g.121753753_
   121753754insTC/c.1064_1065insGA,g.121753754_121753755insCA/c.1064_1065insGT/g
   .121753753_121753754insAC/c.1063_1064insTG;source=CCDS