Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem in EnsemblDB for gene_ids with lowercase characters and no gene name #59

Open
manulera opened this issue Aug 4, 2023 · 1 comment

Comments

@manulera
Copy link

manulera commented Aug 4, 2023

Hello, first let me say I think transvar is great, and we are planning to use it in PomBase, the model organism database for fission yeast.

I think there is a bug / inconsistency that arises like this. Let's say we have a feature in a GTF file (real example from pombe genome), that has lowercase in the gene_id and does not have a gene name:

II	PomBase	gene	177469	180628	.	-	.	gene_id "SPBC1198.04c"; gene_biotype "protein_coding";

I can create a database and retreive the gene:

db = AnnoDB(args, read_config())
print('in',list(db.get_gene('SPBC1198.04c')))

However, because of this:

transvar/transvar/anno.py

Lines 193 to 195 in f7c17a8

q.tok = q.tok.upper()
genefound = False
for q.gene in db.get_gene(q.tok, args.strictversion):

Or this:

transvar/transvar/anno.py

Lines 153 to 155 in f7c17a8

q.tok = q.tok.upper()
genefound = False
for q.gene in db.get_gene(q.tok, args.strictversion):

It's impossible to use it for a protein annotation such as SPBC1198.04c:p.N3A, because the q.tok is made uppercase and it does not exist in db.

A possible fix for this particular case would be to do gene_id.upper() here, although I am not sure it would break something else.

if 'gene_name' in info:
g.name = info['gene_name'].upper()
else:
g.name = gene_id

Minimal example to show the problem is below, and the files to reproduce:

Then, to set up the transvar config:

# env vars
export TRANSVAR_CFG="$(pwd)/data/transvar.cfg"
export TRANSVAR_DOWNLOAD_DIR="$(pwd)/data/transvar_download"

transvar config -k reference -v "data/pombe_genome.fa" --refversion pombe_genome
transvar index --ensembl data/pombe_genome.gtf --reference data/pombe_genome.fa

Then run the code that gives the error (transvar_main_script below is bin/transvar from which I import the parser functions)

from transvar.annodb import AnnoDB
from transvar.config import read_config
import argparse
from transvar_main_script import parser_add_annotation, parser_add_mutation, parser_add_general
from transvar.anno import main_anno
from functools import partial

parser = argparse.ArgumentParser(description=__doc__)
subparsers = parser.add_subparsers()

p = subparsers.add_parser("panno", help='annotate protein element')
parser_add_annotation(p)
parser_add_mutation(p)
parser_add_general(p)
p.set_defaults(func=partial(main_anno, at='p'))

variant_type = 'panno'
variant_description = 'SPBC1198.04c:p.N3A'
args = parser.parse_args([variant_type, '-i', variant_description, '--ensembl', 'data/pombe_genome.gtf.transvardb', '--reference', 'data/pombe_genome.fa', '-v', '2'])

db = AnnoDB(args, read_config())

print('the gene is found in the db: ',list(db.get_gene('SPBC1198.04c')))

print('the gene is not found in the panno call')
args.func(args)
@manulera
Copy link
Author

manulera commented Aug 4, 2023

That solution is not great though, because then when printing the line, it will print the gene_id in uppercase, which is not ideal.

Instead, one could make get_gene try with and without uppercase?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant