Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lightweight Gene class for gffutils #21

Open
yarden opened this issue May 19, 2013 · 1 comment
Open

Lightweight Gene class for gffutils #21

yarden opened this issue May 19, 2013 · 1 comment

Comments

@yarden
Copy link
Contributor

yarden commented May 19, 2013

A feature that I'd find very useful is parsing of gffutils GFF records into a (lightweight) Gene-like object. Something like:

# recs is a set of GFF records
gene_obj = make_gene_obj(recs)
# Access the mRNAs, their exons
first_mRNA = gene_obj.mRNAs[0]
print "Exons of mRNA: ", first_mRNA
for exon_num, exon in enumerate(first_mRNA.exons):
  print "Exon %d is: " %(exon_num), exon 

The reason this is useful is because:

  1. You don't have to keep thinking of parent-child queries for canonical structures like a gene; you can think of things as objects, which is very intuitive. A gene has a list of mRNAs as one of its attributes, and each mRNA has a list of exons (or introns, or CDSS) as its attributes.
  2. The object representation would internally order the features in a predictable way, kind of like GFFWriter does for serialization. For example, if you iterate over the exons belonging to an mRNA, as in:
for exon in first_mRNA.exons:
  # access exon
  ...

then it'd be helpful to know that you're guaranteed to get the exons sorted by their coordinate, so that you're iterating over the mRNA's exons from the start of the mRNA to finish. I believe that if you simply do db.children(mRNA_id) you're not guaranteed to get it in order. Similarly for getting the mRNA records of the gene. It's helpful to know that they're sorted by some canonical order, like longest to shortest, so that gene_obj.mRNAs[0] is always the longest mRNA.

I think the ideal implementation would try to keep the objects as similar to GFF features as possible. For example, you should be able to do: gene_obj.mRNAs[0].start, gene_obj.mRNAs[0].stop, etc. The mRNA object could even inherit from the GFF feature class.

The way I'm thinking of accessing the objects would be through an iterator that returns gene objects, making them on the fly, something like:

for gene_obj in db.get_genes():
  # access gene
  ...

where get_genes() internally does something like:

def get_genes(self, ...):
   # Get records belonging to each gene
   for gene_recs in db.iter_by_parent_childs(featuretype='gene'):
     # make the gene's records into an object in canonical order
     # where mRNAs appear in attributes sorted by length
     # and where exons appear in attributes sorted by their
     # start coordinate
     gene_obj = records_to_gene(gene_recs)
     yield gene_obj

What do you think? Is this something you think is worth putting into gffutils? If so, do you have ideas on what would be the most efficient way to do it? The naive sketch of get_genes() might be fast enough, but I'm not sure. Thanks, --Yarden

@daler
Copy link
Owner

daler commented May 19, 2013

This sounds like a good idea, and pretty simple to implement. Sort of a special-case scenario for when your GFF file has a known, gene-centric format (which is the format you and I work with most, but I'm trying to keep gffutils general).

I believe that if you simply do db.children(mRNA_id) you're not guaranteed to get it in order

Right, but this is actually on purpose -- since there's additional sorting overhead, it's disabled by default for speed. If you want them in order, then you can always use db.children(mRNA_id, order_by='start').

For this gene-specific case, I think it's fine to save all the children as lists in a gene model which will make manipulation as an object more intuitive. As you mentioned, most of the work is done by your GFFWriter class - just put the various mRNAs and exons in the right places in an object rather than printing their string representation.

One unresolved issue though is where to put the get_genes function. I don't think it should be a method on FeatureDB, since pretty much all the other methods return iterators of Feature objects and it could be confusing. Maybe a separate module? genemodels.py, maybe?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants