You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A feature that I'd find very useful is parsing of gffutils GFF records into a (lightweight) Gene-like object. Something like:
# recs is a set of GFF records
gene_obj = make_gene_obj(recs)
# Access the mRNAs, their exons
first_mRNA = gene_obj.mRNAs[0]
print "Exons of mRNA: ", first_mRNA
for exon_num, exon in enumerate(first_mRNA.exons):
print "Exon %d is: " %(exon_num), exon
The reason this is useful is because:
You don't have to keep thinking of parent-child queries for canonical structures like a gene; you can think of things as objects, which is very intuitive. A gene has a list of mRNAs as one of its attributes, and each mRNA has a list of exons (or introns, or CDSS) as its attributes.
The object representation would internally order the features in a predictable way, kind of like GFFWriter does for serialization. For example, if you iterate over the exons belonging to an mRNA, as in:
for exon in first_mRNA.exons:
# access exon
...
then it'd be helpful to know that you're guaranteed to get the exons sorted by their coordinate, so that you're iterating over the mRNA's exons from the start of the mRNA to finish. I believe that if you simply do db.children(mRNA_id) you're not guaranteed to get it in order. Similarly for getting the mRNA records of the gene. It's helpful to know that they're sorted by some canonical order, like longest to shortest, so that gene_obj.mRNAs[0] is always the longest mRNA.
I think the ideal implementation would try to keep the objects as similar to GFF features as possible. For example, you should be able to do: gene_obj.mRNAs[0].start, gene_obj.mRNAs[0].stop, etc. The mRNA object could even inherit from the GFF feature class.
The way I'm thinking of accessing the objects would be through an iterator that returns gene objects, making them on the fly, something like:
for gene_obj in db.get_genes():
# access gene
...
where get_genes() internally does something like:
def get_genes(self, ...):
# Get records belonging to each gene
for gene_recs in db.iter_by_parent_childs(featuretype='gene'):
# make the gene's records into an object in canonical order
# where mRNAs appear in attributes sorted by length
# and where exons appear in attributes sorted by their
# start coordinate
gene_obj = records_to_gene(gene_recs)
yield gene_obj
What do you think? Is this something you think is worth putting into gffutils? If so, do you have ideas on what would be the most efficient way to do it? The naive sketch of get_genes() might be fast enough, but I'm not sure. Thanks, --Yarden
The text was updated successfully, but these errors were encountered:
This sounds like a good idea, and pretty simple to implement. Sort of a special-case scenario for when your GFF file has a known, gene-centric format (which is the format you and I work with most, but I'm trying to keep gffutils general).
I believe that if you simply do db.children(mRNA_id) you're not guaranteed to get it in order
Right, but this is actually on purpose -- since there's additional sorting overhead, it's disabled by default for speed. If you want them in order, then you can always use db.children(mRNA_id, order_by='start').
For this gene-specific case, I think it's fine to save all the children as lists in a gene model which will make manipulation as an object more intuitive. As you mentioned, most of the work is done by your GFFWriter class - just put the various mRNAs and exons in the right places in an object rather than printing their string representation.
One unresolved issue though is where to put the get_genes function. I don't think it should be a method on FeatureDB, since pretty much all the other methods return iterators of Feature objects and it could be confusing. Maybe a separate module? genemodels.py, maybe?
A feature that I'd find very useful is parsing of gffutils GFF records into a (lightweight) Gene-like object. Something like:
The reason this is useful is because:
GFFWriter
does for serialization. For example, if you iterate over the exons belonging to an mRNA, as in:then it'd be helpful to know that you're guaranteed to get the exons sorted by their coordinate, so that you're iterating over the mRNA's exons from the start of the mRNA to finish. I believe that if you simply do
db.children(mRNA_id)
you're not guaranteed to get it in order. Similarly for getting the mRNA records of the gene. It's helpful to know that they're sorted by some canonical order, like longest to shortest, so thatgene_obj.mRNAs[0]
is always the longest mRNA.I think the ideal implementation would try to keep the objects as similar to GFF features as possible. For example, you should be able to do:
gene_obj.mRNAs[0].start
,gene_obj.mRNAs[0].stop
, etc. ThemRNA
object could even inherit from the GFF feature class.The way I'm thinking of accessing the objects would be through an iterator that returns gene objects, making them on the fly, something like:
where
get_genes()
internally does something like:What do you think? Is this something you think is worth putting into gffutils? If so, do you have ideas on what would be the most efficient way to do it? The naive sketch of
get_genes()
might be fast enough, but I'm not sure. Thanks, --YardenThe text was updated successfully, but these errors were encountered: