Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

missing and inconsistent protein annotation usage #35

Open
git-jemiller opened this issue Jul 5, 2019 · 1 comment
Open

missing and inconsistent protein annotation usage #35

git-jemiller opened this issue Jul 5, 2019 · 1 comment

Comments

@git-jemiller
Copy link

git-jemiller commented Jul 5, 2019

I'm trying to annotate a protein with its genomic coordinates using transvar and for most proteins it works fine, but sometimes nothing is returned except for the header of the output. How should I interpret this result? Or am I doing something wrong?

transvar panno --ensembl --idmap uniprot -i 'W5XKT8'
input	transcript	gene	strand	coordinates(gDNA/cDNA/protein)	region	info

Also, why do some proteins need their isoform to get any output and others do not?

Here's an example:


#returns output
transvar panno -i 'Q6N069-1' --uniprot --ensembl
input	transcript	gene	strand	coordinates(gDNA/cDNA/protein)	region	info
Q6N069-1	ENST00000379406 (protein_coding)	NAA16	+	chr13:g.41885341_41951166/c.1_2592/p.M1_I864	whole_transcript	promoter=chr13:41884341_41885341;#exons=20;cds=chr13:41885665_41949735

#no output
transvar panno -i 'Q6N069' --uniprot --ensembl
input	transcript	gene	strand	coordinates(gDNA/cDNA/protein)	region	info

#returns output without providing isoform number
transvar panno -i 'Q9H1K6' --uniprot --ensembl
input	transcript	gene	strand	coordinates(gDNA/cDNA/protein)	region	info
Q9H1K6	ENST00000267984 (protein_coding)	MESDC1	+	chr15:g.81293295_81296342/c.1_1086/p.M1_N362	whole_transcript	promoter=chr15:81292295_81293295;#exons=1;cds=chr15:81294613_81295698

Thanks!

@zwdzwd
Copy link
Owner

zwdzwd commented Aug 17, 2019

Hi,

Sorry for the late response. TransVar has been using the ID mapping from uniprot. More specifically it's from this file

ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/by_organism/HUMAN_9606_idmapping.dat.gz

Therefore if your identifier isn't linked to any transcript id in this file, transvar wouldn't be able to locate transcript definition. That's what happened to W5XKT8 and Q6N069. There has also to be a match between the transcript ID from the id mapping file and the transcript definition used. You could also use a customized ID mapping if you know how to project Uniprot ID to transcript ID (Ensembl, Refseq etc). This is done by

transvar index --idmap <idmapping file> -o <output_idx>

idmapping file has two columns, the first being uniprot ID, the second being the transcript ID.
once done
you could use something like

transvar panno --idmap <output_idx>

as usual.

Let me know if you know a better way to map these IDs. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants