-
Notifications
You must be signed in to change notification settings - Fork 33
FAQ
For all mutation types, one can batch process a list of mutation identifiers with optional transcript id to constraint the search. Take SNV for example,
transvar panno -l example/input_table -g 1 -m 5 -t 2 --ensembl -o 2,3,4
As suggested by the command, TransVar takes as input the 1st column as gene and 4th column as identifier. The 2nd column will be used as the transcript id from Ensembl to constrain the alternative identifier search. The 2nd, 3rd and 5th columns are chosen to be output as a validation of TransVar's performance.
Input:
ADAMTSL3 ENST00000286744 15:84442328 c.243G>A p.W81* Nonsense
ADAMTSL3 ENST00000286744 15:84442326 c.241T>C p.W81R Missense
ADAMTSL4 ENST00000369038 1:150530513 c.2270G>A p.G757D Missense
ADCY2 ENST00000338316 5:7802364 c.2662G>A p.V888I Missense
ADCY2 ENST00000338316 5:7802365 c.2663T>C p.V888A Missense
Output:
ENST00000286744|15:84442328|c.243G>A ENST00000286744 (protein_coding) ADAMTSL3 +
chr15:g.84442327G>A/c.242G>A/p.W81* cds_in_exon_4
reference_codon=TGG;candidate_codons=TAA,TAG,TGA;candidate_snv_variants=chr15
:g.84442328G>A;candidate_mnv_variants=chr15:g.84442327_84442328delGGinsAA;mis
sense;aliases=ENSP00000286744;source=Ensembl
ENST00000286744|15:84442326|c.241T>C ENST00000286744 (protein_coding) ADAMTSL3 +
chr15:g.84442326T>A/c.241T>A/p.W81R cds_in_exon_4
reference_codon=TGG;candidate_codons=AGG,AGA,CGA,CGC,CGG,CGT;candidate_snv_va
riants=chr15:g.84442326T>C;candidate_mnv_variants=chr15:g.84442326_84442328de
lTGGinsAGA,chr15:g.84442326_84442328delTGGinsCGA,chr15:g.84442326_84442328del
TGGinsCGC,chr15:g.84442326_84442328delTGGinsCGT;missense;aliases=ENSP00000286
744;source=Ensembl
ENST00000369038|1:150530513|c.2270G>A ENST00000369038 (protein_coding) ADAMTSL4 +
chr1:g.150530513G>A/c.2270G>A/p.G757D cds_in_exon_12
reference_codon=GGT;candidate_codons=GAC,GAT;candidate_mnv_variants=chr1:g.15
0530513_150530514delGTinsAC;missense;aliases=ENSP00000358034;source=Ensembl
ENST00000338316|5:7802364|c.2662G>A ENST00000338316 (protein_coding) ADCY2 +
chr5:g.7802364G>A/c.2662G>A/p.V888I cds_in_exon_21
reference_codon=GTC;candidate_codons=ATC,ATA,ATT;candidate_mnv_variants=chr5:
g.7802364_7802366delGTCinsATA,chr5:g.7802364_7802366delGTCinsATT;missense;ali
ases=ENSP00000342952;source=Ensembl
ENST00000338316|5:7802365|c.2663T>C ENST00000338316 (protein_coding) ADCY2 +
chr5:g.7802365T>C/c.2663T>C/p.V888A cds_in_exon_21
reference_codon=GTC;candidate_codons=GCA,GCC,GCG,GCT;candidate_mnv_variants=c
hr5:g.7802365_7802366delTCinsCA,chr5:g.7802365_7802366delTCinsCG,chr5:g.78023
65_7802366delTCinsCT;missense;aliases=ENSP00000342952;source=Ensembl
TransVar can take VCF as input when annotating from genomic level.
transvar ganno --vcf ALL.wgs.phase1_release_v3.20101123.snps_indel_sv.sites.vcf.gz --ccds
# or
transvar ganno --vcf demo.1kg.vcf --ccds
TransVar performs local alignment to allow long haplotype to be decomposed into multiple mutations.
$ transvar ganno --ccds -i 'chr20:g.645097_645111delinsGTGCGATACCCAGGAG' --haplotype
leads to 2 snv and one insertion
chr20:g.645097_645111delinsGTGCGATACCCAGGAG CCDS13006 (protein_coding) SCRT2 -
chr20:g.645098G>T/c.141C>A/p.A47A inside_[cds_in_exon_2]
CSQN=Synonymous;codon_pos=645098-645099-645100;ref_codon_seq=GCC;source=CCDS
chr20:g.645097_645111delinsGTGCGATACCCAGGAG CCDS13006 (protein_coding) SCRT2 -
chr20:g.645101_645102insA/c.137_138insT/p.A47Rfs*350 inside_[cds_in_exon_2]
CSQN=Frameshift;left_align_gDNA=g.645101_645102insA;unalign_gDNA=g.645101_645
102insA;left_align_cDNA=c.137_138insT;unalign_cDNA=c.137_138insT;source=CCDS
chr20:g.645097_645111delinsGTGCGATACCCAGGAG CCDS13006 (protein_coding) SCRT2 -
chr20:g.645107T>A/c.134-2A>T/. inside_[intron_between_exon_1_and_2]
CSQN=SpliceAcceptorSNV;C2=SpliceAcceptorOfExon1_At_chr20:645106;source=CCDS
TransVar automatically infer whether the input is a 3-letter code or 1-letter code.
The output is default to 1-letter code. But can be switched to 3-letter code through the --aa3
option.
For example,
$ transvar panno --ccds -i 'PIK3CA:p.Glu545Lys' --aa3
PIK3CA:p.Glu545Lys CCDS43171 (protein_coding) PIK3CA +
chr3:g.178936091G>A/c.1633G>A/p.Glu545Lys inside_[cds_in_exon_9]
CSQN=Missense;reference_codon=GAG;candidate_codons=AAG,AAA;candidate_mnv_vari
ants=chr3:g.178936091_178936093delGAGinsAAA;dbsnp=rs104886003(chr3:178936091G
>A);source=CCDS
The option --aacontext 5
output +/- 5bp protein sequence context.
$ transvar ganno -i 'chr17:7577124' --ccds --aacontext 5
chr17:7577124 CCDS11118 (protein_coding) TP53 -
chr17:g.7577124C>/c.814G>/p.V272 inside_[cds_in_exon_7]
is_gene_body;aacontext=RNSFE[V]RVCAC;codon_pos=7577122-7577123-7577124;source
=CCDS
chr17:7577124 CCDS45605 (protein_coding) TP53 -
chr17:g.7577124C>/c.814G>/p.V272 inside_[cds_in_exon_7]
is_gene_body;aacontext=RNSFE[V]RVCAC;codon_pos=7577122-7577123-7577124;source
=CCDS
chr17:7577124 CCDS45606 (protein_coding) TP53 -
chr17:g.7577124C>/c.814G>/p.V272 inside_[cds_in_exon_7]
is_gene_body;aacontext=RNSFE[V]RVCAC;codon_pos=7577122-7577123-7577124;source
=CCDS
shows the protein sequence context in the aacontext tag.
Use --oneline
option. This separates the outputs from each transcript by '|||'.
Most likely you forgot to specify a transcipt definition such as --ccds
or --ensembl
. Sometimes there are non-canonical names for genes, this can be fixed through the --alias
option and specify an alias table. TransVar comes with alias table from UCSC knownGene.
Yes, but only in input. For example, c.508_509CC>TT
$ transvar canno --ccds -i 'A1CF:c.508_509CC>TT'
A1CF:c.508_509CC>TT CCDS7241 (protein_coding) A1CF -
chr10:g.52595929_52595930delinsAA/c.508_509delinsTT/p.P170L inside_[cds_in_exon_4]
CSQN=Missense;codon_cDNA=508-509-510;source=CCDS
A1CF:c.508_509CC>TT CCDS7242 (protein_coding) A1CF -
chr10:g.52595929_52595930delinsAA/c.508_509delinsTT/p.P170L inside_[cds_in_exon_4]
CSQN=Missense;codon_cDNA=508-509-510;source=CCDS
Yes, the 'g.', 'c.' and 'p.' are optional in the input. For example, 12:109702119insACC
is equally acceptable as chr12:g.109702119_109702120insACC
. TransVar also accepts '>' in denoting MNV. E.g., c.113G>TACTAGC
can be used in place of c.113delGinsTACTAGC
. This is common in some database such as COSMIC.
When I annotate a variant for protein identifier, why would I end up getting results in another variant type?
TransVar follows in full the HGVS nomenclature while annotating protein level mutation identifiers. For example, a out-of-phase, in frame insertion, ACIN1:c.1930_1931insATTCAC
will be annotated with p.S643_R644insHS
rather than R644delinsHSR
. Protein level mutation will be generated as if no nucleotide mutation information exists.