-
Notifications
You must be signed in to change notification settings - Fork 33
Annotation From Protein Level
Protein level inputs are handled by the panno
subcommand.
To use uniprot id as protein name, one must first download the uniprot id map by
transvar config --download_idmap
Then one could use protein id instead of gene name by applying the --uniprot
option to TransVar. For example,
$ transvar panno --ccds -i 'Q5VUM1:47' --uniprot
Q5VUM1:47 CCDS4972 (protein_coding) C6ORF57 +
chr6:g.71289191_71289193/c.139_141/p.47S inside_[cds_in_exon_2]
protein_sequence=S;cDNA_sequence=TCC;gDNA_sequence=TCC;source=CCDS
TransVar use a keyword extension ref
in Q5VUM1:p.47refS
to differentiate from the synonymous mutation Q5VUM1:p.47S
. The former notation specifies that the reference protein sequence is S
while the later specifies the target protein sequence is S
.
For example, one can find the genomic location of a DRY motif in protein P28222 by issuing the following command,
$ transvar panno -i 'P28222:p.146_148refDRY' --uniprot --ccds
P28222:p.146_148refDRY CCDS4986 (protein_coding) HTR1B -
chr6:g.78172677_78172685/c.436_444/p.D146_Y148 inside_[cds_in_exon_1]
protein_sequence=DRY;cDNA_sequence=GACCGCTAC;gDNA_sequence=GTAGCGGTC;source=C
CDS
One can also use wildcard x
(lowercase) in the motif.
$ transvar panno -i 'HTR1B:p.365_369refNPxxY' --ccds
HTR1B:p.365_369refNPxxY CCDS4986 (protein_coding) HTR1B -
chr6:g.78172014_78172028/c.1093_1107/p.N365_Y369 inside_[cds_in_exon_1]
protein_sequence=NPIIY;cDNA_sequence=AAC..TAT;gDNA_sequence=ATA..GTT;source=C
CDS
$ transvar panno --ccds -i 'ABCB11:p.200_400'
outputs
ABCB11:p.200_400 CCDS46444 (protein_coding) ABCB11 -
chr2:g.169833195_169851872/c.598_1200/p.T200_K400 inside_[cds_in_exons_[6,7,8,9,10,11]]
protein_sequence=TRF..DRK;cDNA_sequence=ACA..AAA;gDNA_sequence=TTT..TGT;sourc
e=CCDS
Mutation formats acceptable in TransVar are PIK3CA:p.E545K
or without reference or alternative amino acid identity, e.g., PIK3CA:p.545K
or PIK3CA:p.E545
. TransVar takes native HGVS format inputs and outputs. The reference amino acid is used to narrow the search scope of candidate transcripts. The alternative amino acid is used to infer nucleotide change which results in the amino acid.
$ transvar panno -i PIK3CA:p.E545K --ensembl
outputs
PIK3CA:p.E545K ENST00000263967 (protein_coding) PIK3CA +
chr3:g.178936091G>A/c.1633G>A/p.E545K inside_[cds_in_exon_10]
CSQN=Missense;reference_codon=GAG;candidate_codons=AAG,AAA;candidate_mnv_vari
ants=chr3:g.178936091_178936093delGAGinsAAA;dbsnp=rs104886003(chr3:178936091G
>A);aliases=ENSP00000263967;source=Ensembl
One may encounter ambiguous cases where the multiple substitutions exist in explaining the amino acid change. For example,
$ transvar panno -i ACSL4:p.R133R --ccds
outputs
ACSL4:p.R133R CCDS14548 (protein_coding) ACSL4 -
chrX:g.108926078G>T/c.399C>A/p.R133R inside_[cds_in_exon_2]
CSQN=Synonymous;reference_codon=CGC;candidate_codons=AGG,AGA,CGA,CGG,CGT;cand
idate_snv_variants=chrX:g.108926078G>C,chrX:g.108926078G>A;candidate_mnv_vari
ants=chrX:g.108926078_108926080delGCGinsCCT,chrX:g.108926078_108926080delGCGi
nsTCT;source=CCDS
In those cases, TransVar prioritizes all the candidate base changes by minimizing the edit distance between the reference codon sequence and the target codon sequence. One of the optimal base changes is arbitrarily chosen as the default and all the candidates are included in the appended CddMuts
entry.
TransVar instantiates input of ambiguous amino acid code such as ('B', for "Asx", which stands for "Asp" or "Asn") to more specific amino acid. Even if the reference amino acid is a subset of the ambiguous alternative amino acid, TransVar assume a mutation on the nucleotide level (can still deduce synonymous mutations):
$ transvar panno -i 'APC:p.D326B' --ccds
APC:p.D326B CCDS4107 (protein_coding) APC +
chr5:g.112154705G>A/c.976G>A/p.D326N inside_[cds_in_exon_9]
CSQN=Missense;reference_codon=GAT;candidate_codons=AAC,AAT,GAC;candidate_snv_
variants=chr5:g.112154707T>C;candidate_mnv_variants=chr5:g.112154705_11215470
7delGATinsAAC;source=CCDS
Here input alternative amino acids is B (D or N). After TransVar processing, a 'N' is derived (though a D is equally likely, as shown in the candidates).
$ transvar panno --ccds -i 'AATK:p.P1331_A1332insTP'
AATK:p.P1331_A1332insTP CCDS45807 (protein_coding) AATK -
chr17:g.(79093267ins6)/c.(3997_3991ins6)/p.T1330_P1331dupTP inside_[cds_in_exon_13]
CSQN=InFrameInsertion;left_align_protein=p.A1326_P1327insPT;unalign_protein=p
.T1330_P1331dupTP;insertion_cDNA=ACACCT;insertion_gDNA=AGGTGT;imprecise;sourc
e=CCDS
$ transvar panno --ccds -i 'AADACL4:p.W263_I267delWRDAI'
AADACL4:p.W263_I267delWRDAI CCDS30590 (protein_coding) AADACL4 +
chr1:g.12726309_12726323del/c.787_801del/p.W263_I267delWRDAI inside_[cds_in_exon_4]
CSQN=InFrameDeletion;left_align_protein=p.W263_I267delWRDAI;unalign_protein=p
.W263_I267delWRDAI;imprecise;source=CCDS
$ transvar panno --ccds -i 'ABCC3:p.Y556_V557delinsRRR'
ABCC3:p.Y556_V557delinsRRR CCDS32681 (protein_coding) ABCC3 +
chr17:g.48745254_48745259delinsAGGAGGAGG/c.1666_1671delinsAGGAGGAGG/p.Y556_V557delinsRRR inside_[cds_in_exon_13]
CSQN=MultiAAMissense;imprecise;source=CCDS
$ transvar panno --ccds -i 'A1BG:p.G132fs*2'
A1BG:p.G132fs*2 CCDS12976 (protein_coding) A1BG -
chr19:g.58863868delC/c.395delG/p.G132fs*2 inside_[cds_in_exon_4]
CSQN=Frameshift;left_align_cDNA=c.394delG;left_align_gDNA=g.58863867delC;cand
idates=g.58863873delG/c.393delC/g.58863869delG/c.389delC;source=CCDS
TransVar can also take protein identifiers such as as input. For example,
$ transvar panno --refseq -i 'NP_006266.2:p.G240Afs*50'
NP_006266.2:p.G240Afs*50 NM_006275 (protein_coding) SRSF6 +
chr20:g.42089385delA/c.717delA/p.G240Afs*50 inside_[cds_in_exon_6]
CSQN=Frameshift;left_align_cDNA=c.714delA;left_align_gDNA=g.42089382delA;cand
idates=g.42089387delG/c.719delG/g.42089386delG/c.718delG;dbxref=GeneID:6431,H
GNC:10788,HPRD:09054,MIM:601944;aliases=NP_006266;source=RefSeq
The output gives the exact details of the mutation on the DNA levels, properly right-aligned. The candidates
fields also include other equally-likely mutation identifiers. candidates
have the format [right-align-gDNA]/[right-align-cDNA]/[left-align-gDNA]/[left-align-cDNA]
for each hit and ,
separation between hits.
Similar applies when the underlying mutation is an insertion. TransVar can infer insertion sequence of under 3 base pairs long. For example,
$ transvar panno -i 'AASS:p.I355Mfs*10' --ccds
AASS:p.I355Mfs*10 CCDS5783 (protein_coding) AASS -
chr7:g.121753753_121753754insCC/c.1064_1065insGG/p.I355Mfs*10 inside_[cds_in_exon_9]
CSQN=Frameshift;left_align_cDNA=c.1064_1065insGG;left_align_gDNA=g.121753753_
121753754insCC;candidates=g.121753753_121753754insGC/c.1064_1065insGC/g.12175
3753_121753754insGC/c.1064_1065insGC,g.121753753_121753754insTC/c.1064_1065in
sGA/g.121753753_121753754insTC/c.1064_1065insGA,g.121753754_121753755insCA/c.
1064_1065insGT/g.121753753_121753754insAC/c.1063_1064insTG;source=CCDS
When the alternative becomes a stop codon, frameshift mutation becomes a nonsense mutation:
$ transvar panno -i 'APC:p.I1557*fs*3' --ccds
returns a nonsense mutation
APC:p.I1557*fs*3 CCDS4107 (protein_coding) APC +
chr5:g.112175960_112175962delATTinsTAA/c.4669_4671delATTinsTAA/p.I1557* inside_[cds_in_exon_15]
CSQN=Nonsense;reference_codon=ATT;candidate_codons=TAA,TAG,TGA;candidate_mnv_
variants=chr5:g.112175960_112175962delATTinsTAG,chr5:g.112175960_112175962del
ATTinsTGA;source=CCDS
TransVar provides an easy way to investigate a whole transcript by supplying the gene id.
$ transvar panno -i 'Dnmt3a' --refseq
outputs the basic information of transcripts of the protein, in an intuitive way,
Dnmt3a XM_005264176 (protein_coding) DNMT3A -
chr2:g.25451421_25537541/c.1_2739/p.M1_*913 whole_transcript
promoter=chr2:25537541_25538541;#exons=23;cds=chr2:25457148_25536853
Dnmt3a XM_005264175 (protein_coding) DNMT3A -
chr2:g.25451421_25537354/c.1_2739/p.M1_*913 whole_transcript
promoter=chr2:25537354_25538354;#exons=23;cds=chr2:25457148_25536853
Dnmt3a XM_005264177 (protein_coding) DNMT3A -
chr2:g.25451421_25475145/c.1_2070/p.M1_*690 whole_transcript
promoter=chr2:25475145_25476145;#exons=18;cds=chr2:25457148_25471091
Dnmt3a NM_175629 (protein_coding) DNMT3A -
chr2:g.25455830_25565459/c.1_2739/p.M1_*913 whole_transcript
promoter=chr2:25565459_25566459;#exons=23;cds=chr2:25457148_25536853
Dnmt3a NM_022552 (protein_coding) DNMT3A -
chr2:g.25455830_25564784/c.1_2739/p.M1_*913 whole_transcript
promoter=chr2:25564784_25565784;#exons=23;cds=chr2:25457148_25536853
Dnmt3a NM_153759 (protein_coding) DNMT3A -
chr2:g.25455830_25475184/c.1_2172/p.M1_*724 whole_transcript
promoter=chr2:25475184_25476184;#exons=19;cds=chr2:25457148_25475066
Dnmt3a NM_175630 (protein_coding) DNMT3A -
chr2:g.25504321_25565459/c.1_501/p.M1_*167 whole_transcript
promoter=chr2:25565459_25566459;#exons=4;cds=chr2:25505257_25536853
An identifier is regarded as an alternative if the underlying codon overlap with the one from the original identifier. Example: to search alternative identifiers of CDKN2A.p.58 (without knowing reference allele),
$ transvar codonsearch --ccds -i CDKN2A:p.58
origin_id alt_id chrm codon1
codon2 transcripts_choice
CDKN2A:p.58 CDKN2A.p.73 chr9 21971184-21971185-21971186
21971182-21971183-21971184 CCDS6510[CCDS]/CCDS6511[CCDS],CCDS56565[CCDS]/CCDS6511[CCDS]
CDKN2A:p.58 CDKN2A.p.72 chr9 21971184-21971185-21971186
21971185-21971186-21971187 CCDS6510[CCDS]/CCDS6511[CCDS],CCDS56565[CCDS]/CCDS6511[CCDS]
The pair of transcript id listed corresponds to the transcripts based on which, the original and alternative identifiers are defined. Multiple pairs of transcript definitions are appended following a ,
.
Example: to search alternative identifiers of DHODH:G152R (knowing reference allele G
, alternative allele here will be ignored),
$ transvar codonsearch -i DHODH:G152R --refseq
outputs
origin_id alt_id chrm codon1
codon2 transcripts_choice
DHODH:G152R DHODH.p.G16 chr16 72050942-72050943-72050944
72050942-72050943-72050944 NM_001361[RefSeq]/XM_005255828[RefSeq]
DHODH:G152R DHODH.p.G9 chr16 72050942-72050943-72050944
72050942-72050943-72050944 NM_001361[RefSeq]/XM_005255829[RefSeq]
DHODH:G152R DHODH.p.G124 chr16 72050942-72050943-72050944
72050942-72050943-72050944 NM_001361[RefSeq]/XM_005255827[RefSeq]
TransVar outputs genomic positions of codons based on original transcript (4th column in the output) and alternative transcript (5th column in the output). The potential transcript usages are also appended.
Example: to run transvar codonsearch
to batch process a list of mutation identifiers.
$ transvar codonsearch -l example/input_table2 --ccds -m 1 -o 1
Example input table
origin_id alt_id chrm codon1
codon2 transcripts_choice
CDKN2A:p.61 CDKN2A.p.76 chr9 21971175-21971176-21971177
21971173-21971174-21971175 CCDS6510[CCDS]/CCDS6511[CCDS],CCDS56565[CCDS]/CCDS6511[CCDS]
CDKN2A:p.61 CDKN2A.p.75 chr9 21971175-21971176-21971177
21971176-21971177-21971178 CCDS6510[CCDS]/CCDS6511[CCDS],CCDS56565[CCDS]/CCDS6511[CCDS]
CDKN2A:p.69 CDKN2A.p.54 chr9 21971194-21971195-21971196
21971196-21971197-21971198 CCDS6511[CCDS]/CCDS6510[CCDS],CCDS6511[CCDS]/CCDS56565[CCDS]
CDKN2A:p.69 CDKN2A.p.55 chr9 21971194-21971195-21971196
21971193-21971194-21971195 CCDS6511[CCDS]/CCDS6510[CCDS],CCDS6511[CCDS]/CCDS56565[CCDS]
CDKN2A:p.69 CDKN2A.p.83 chr9 21971151-21971152-21971153
21971152-21971153-21971154 CCDS6510[CCDS]/CCDS6511[CCDS],CCDS56565[CCDS]/CCDS6511[CCDS]
CDKN2A:p.69 CDKN2A.p.84 chr9 21971151-21971152-21971153
21971149-21971150-21971151 CCDS6510[CCDS]/CCDS6511[CCDS],CCDS56565[CCDS]/CCDS6511[CCDS]
ERBB2:p.755 ERBB2.p.785 chr17 37881024-37881025-37881026
37881024-37881025-37881026 CCDS45667[CCDS]/CCDS32642[CCDS]
ERBB2:p.755 ERBB2.p.725 chr17 37880219-37880220-37880221
37880219-37880220-37880221 CCDS32642[CCDS]/CCDS45667[CCDS]
outputs
origin_id alt_id chrm codon1
codon2 transcripts_choice
CDKN2A:p.61 CDKN2A.p.76 chr9 21971175-21971176-21971177
21971173-21971174-21971175 CCDS6510[CCDS]/CCDS6511[CCDS],CCDS56565[CCDS]/CCDS6511[CCDS]
CDKN2A:p.61 CDKN2A.p.75 chr9 21971175-21971176-21971177
21971176-21971177-21971178 CCDS6510[CCDS]/CCDS6511[CCDS],CCDS56565[CCDS]/CCDS6511[CCDS]
CDKN2A:p.69 CDKN2A.p.54 chr9 21971194-21971195-21971196
21971196-21971197-21971198 CCDS6511[CCDS]/CCDS6510[CCDS],CCDS6511[CCDS]/CCDS56565[CCDS]
CDKN2A:p.69 CDKN2A.p.55 chr9 21971194-21971195-21971196
21971193-21971194-21971195 CCDS6511[CCDS]/CCDS6510[CCDS],CCDS6511[CCDS]/CCDS56565[CCDS]
CDKN2A:p.69 CDKN2A.p.83 chr9 21971151-21971152-21971153
21971152-21971153-21971154 CCDS6510[CCDS]/CCDS6511[CCDS],CCDS56565[CCDS]/CCDS6511[CCDS]
CDKN2A:p.69 CDKN2A.p.84 chr9 21971151-21971152-21971153
21971149-21971150-21971151 CCDS6510[CCDS]/CCDS6511[CCDS],CCDS56565[CCDS]/CCDS6511[CCDS]
ERBB2:p.755 ERBB2.p.785 chr17 37881024-37881025-37881026
37881024-37881025-37881026 CCDS45667[CCDS]/CCDS32642[CCDS]
ERBB2:p.755 ERBB2.p.725 chr17 37880219-37880220-37880221
37880219-37880220-37880221 CCDS32642[CCDS]/CCDS45667[CCDS]
The third column indicates the potential transcript usage for the alternative identifier. Each transcript usage is denoted by
/. Different potential choices are separated by ','.Example: to check if MET.p1010 and MET.p992 may be refering to one mutation due to different usage of transcripts,
$ transvar codonsearch --refseq -i MET:p.1010
gives
origin_id alt_id chrm codon1
codon2 transcripts_choice
MET:p.1010 MET.p.562 chr7 116411989-116411990-116411991
116411989-116411990-116411991 NM_001127500[RefSeq]/XM_005250354[RefSeq]
MET:p.1010 MET.p.1029 chr7 116411989-116411990-116411991
116411989-116411990-116411991 NM_001127500[RefSeq]/XM_005250353[RefSeq]
MET:p.1010 MET.p.973 chr7 116411932-116411933-116411934
116411932-116411933-116411934 XM_005250353[RefSeq]/NM_000245[RefSeq]
MET:p.1010 MET.p.580 chr7 116412043-116414935-116414936
116412043-116414935-116414936 NM_000245[RefSeq]/XM_005250354[RefSeq]
MET:p.1010 MET.p.991 chr7 116411932-116411933-116411934
116411932-116411933-116411934 XM_005250353[RefSeq]/NM_001127500[RefSeq]
MET:p.1010 MET.p.543 chr7 116411932-116411933-116411934
116411932-116411933-116411934 XM_005250353[RefSeq]/XM_005250354[RefSeq]
MET:p.1010 MET.p.1028 chr7 116412043-116414935-116414936
116412043-116414935-116414936 NM_000245[RefSeq]/NM_001127500[RefSeq]
MET:p.1010 MET.p.992 chr7 116411989-116411990-116411991
116411989-116411990-116411991 NM_001127500[RefSeq]/NM_000245[RefSeq]
MET:p.1010 MET.p.1047 chr7 116412043-116414935-116414936
116412043-116414935-116414936 NM_000245[RefSeq]/XM_005250353[RefSeq]
Since MET.p.992 is in the list, the two identifiers might be due to the same genomic mutation.