-
Notifications
You must be signed in to change notification settings - Fork 33
Annotation From Protein Level
Protein level inputs are handled by the panno
subcommand.
To use uniprot id as protein name, one must first download the uniprot id map by
transvar config --download_idmap
Then one could use protein id instead of gene name by applying the --uniprot
option to TransVar. For example,
$ transvar panno --ccds -i 'Q5VUM1:47' --uniprot
Q5VUM1:47 CCDS4972 (protein_coding) C6ORF57 +
chr6:g.71289191_71289193/c.139_141/p.47S cds_in_exon_2
protein_sequence=S;cDNA_sequence=TCC;gDNA_sequence=TCC;source=CCDS
TransVar use a keyword extension ref
in Q5VUM1:p.47refS
to differentiate from the synonymous mutation Q5VUM1:p.47S
. The former notation specifies that the reference protein sequence is S
while the later specifies the target protein sequence is S
.
For example, one can find the genomic location of a DRY motif in protein P28222 by issuing the following command,
$ transvar panno -i 'P28222:p.146_148refDRY' --uniprot --ccds
P28222:p.146_148refDRY CCDS4986 (protein_coding) HTR1B -
chr6:g.78172677_78172685/c.436_444/p.D146_Y148 cds_in_exon_1
protein_sequence=DRY;cDNA_sequence=GACCGCTAC;gDNA_sequence=GTAGCGGTC;source=C
CDS
One can also use wildcard x
(lowercase) in the motif.
$ transvar panno -i 'HTR1B:p.365_369refNPxxY' --ccds
HTR1B:p.365_369refNPxxY CCDS4986 (protein_coding) HTR1B -
chr6:g.78172014_78172028/c.1093_1107/p.N365_Y369 cds_in_exon_1
protein_sequence=NPIIY;cDNA_sequence=AAC..TAT;gDNA_sequence=ATA..GTT;source=C
CDS
$ transvar panno --ccds -i 'ABCB11:p.200_400'
outputs
ABCB11:p.200_400 CCDS46444 (protein_coding) ABCB11 -
chr2:g.169833195_169851872/c.598_1200/p.T200_K400 cds_in_exons_[6,7,8,9,10,11]
protein_sequence=TRF..DRK;cDNA_sequence=ACA..AAA;gDNA_sequence=TTT..TGT;sourc
e=CCDS
Mutation formats acceptable in TransVar are PIK3CA:p.E545K
or without reference or alternative amino acid identity, e.g., PIK3CA:p.545K
or PIK3CA:p.E545
. TransVar takes native HGVS format inputs and outputs. The reference amino acid is used to narrow the search scope of candidate transcripts. The alternative amino acid is used to infer nucleotide change which results in the amino acid.
$ transvar panno -i PIK3CA:p.E545K --ensembl
outputs
PIK3CA:p.E545K ENST00000263967 (protein_coding) PIK3CA +
chr3:g.178936091G>A/c.1633G>A/p.E545K cds_in_exon_10
reference_codon=GAG;candidate_codons=AAG,AAA;candidate_mnv_variants=chr3:g.17
8936091_178936093delGAGinsAAA;dbsnp=rs104886003(chr3:178936091G>A);missense;a
liases=ENSP00000263967;source=Ensembl
One may encounter ambiguous cases where the multiple substitutions exist in explaining the amino acid change. For example,
$ transvar panno -i ACSL4:p.R133R --ccds
ACSL4:p.R133R CCDS14548 (protein_coding) ACSL4 -
chrX:g.108926078G>T/c.399C>A/p.R133R cds_in_exon_2
reference_codon=CGC;candidate_codons=AGG,AGA,CGA,CGG,CGT;candidate_snv_varian
ts=chrX:g.108926078G>C,chrX:g.108926078G>A;candidate_mnv_variants=chrX:g.1089
26078_108926080delGCGinsCCT,chrX:g.108926078_108926080delGCGinsTCT;synonymous
;source=CCDS
In those cases, TransVar prioritizes all the candidate base changes by minimizing the edit distance between the reference codon sequence and the target codon sequence. One of the optimal base changes is arbitrarily chosen as the default and all the candidates are included in the appended CddMuts
entry.
$ transvar panno --ccds -i 'AATK:p.P1331_A1332insTP'
AATK:p.P1331_A1332insTP CCDS45807 (protein_coding) AATK -
chr17:g.(79093267ins6)/c.(3997_3991ins6)/p.T1330_P1331dupTP cds_in_exon_13
left_align_protein=p.A1326_P1327insPT;unalign_protein=p.T1330_P1331dupTP;inse
rtion_cDNA=ACACCT;insertion_gDNA=AGGTGT;imprecise;source=CCDS
$ transvar panno --ccds -i 'AADACL4:p.W263_I267delWRDAI'
AADACL4:p.W263_I267delWRDAI CCDS30590 (protein_coding) AADACL4 +
chr1:g.12726309_12726323del/c.787_801del/p.W263_I267delWRDAI inside_[cds_in_exon_4]
left_align_protein=p.W263_I267delWRDAI;unalign_protein=p.W263_I267delWRDAI;im
precise;source=CCDS
$ transvar panno --ccds -i 'ABCC3:p.Y556_V557delinsRRR'
ABCC3:p.Y556_V557delinsRRR CCDS32681 (protein_coding) ABCC3 +
chr17:g.48745254_48745259delinsAGGAGGAGG/c.1666_1671delinsAGGAGGAGG/p.Y556_V557delinsRRR cds_in_exon_13
imprecise;source=CCDS
$ transvar panno --ccds -i 'A1BG:p.G132fs*2'
A1BG:p.G132fs*2 CCDS12976 (protein_coding) A1BG -
chr19:g.58863868delC/c.395delG/p.G132fs*2 cds_in_exon_4
left_align_cDNA=c.394delG;left_align_gDNA=g.58863867delC;candidates=g.5886387
3delG/c.393delC/g.58863869delG/c.389delC;source=CCDS
TransVar can also take protein identifiers such as as input. For example,
$ transvar panno --refseq -i 'NP_006266.2:p.G240Afs*50'
NP_006266.2:p.G240Afs*50 NM_006275 (protein_coding) SRSF6 +
chr20:g.42089385delA/c.717delA/p.G240Afs*50 cds_in_exon_6
left_align_cDNA=c.714delA;left_align_gDNA=g.42089382delA;candidates=g.4208938
7delG/c.719delG/g.42089386delG/c.718delG;dbxref=GeneID:6431,HGNC:10788,HPRD:0
9054,MIM:601944;aliases=NP_006266;source=RefSeq
The output gives the exact details of the mutation on the DNA levels, properly right-aligned. The candidates
fields also include other equally-likely mutation identifiers. candidates
have the format [right-align-gDNA]/[right-align-cDNA]/[left-align-gDNA]/[left-align-cDNA]
for each hit and ,
separation between hits.
Similar applies when the underlying mutation is an insertion. TransVar can infer insertion sequence of under 3 base pairs long. For example,
$ transvar panno -i 'AASS:p.I355Mfs*10' --ccds
AASS:p.I355Mfs*10 CCDS5783 (protein_coding) AASS -
chr7:g.121753753_121753754insCC/c.1064_1065insGG/p.I355Mfs*10 cds_in_exon_9
left_align_cDNA=c.1064_1065insGG;left_align_gDNA=g.121753753_121753754insCC;c
andidates=g.121753753_121753754insGC/c.1064_1065insGC/g.121753753_121753754in
sGC/c.1064_1065insGC,g.121753753_121753754insTC/c.1064_1065insGA/g.121753753_
121753754insTC/c.1064_1065insGA,g.121753754_121753755insCA/c.1064_1065insGT/g
.121753753_121753754insAC/c.1063_1064insTG;source=CCDS