-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Command-line interface for gffutils #2
Comments
I've been thinking about two related GFF functionalities, which I believe reflect very common pattern usages for other users too, but unfortunately no tool that I know of has these features. I'm interested in: (1) an efficient library for parsing GFF files under the hood, from Python, which ideally is dependency-free (see more on this below). (2) a command line utility for quickly "grep"-ing, slicing and viewing GFF files. A bit of background on what I'm trying to achieve and my current issues with GFF3: I generally think that GFF3 is not a good format, and so tools like #1 and #2 are needed to make it workable, whether you're parsing GFF files in your code or just using it to store / browse annotations. A primary problem with GFF3 is that even though it's a hierarchical format, there isn't a single identifier running through each gene tree, and this combined with the fact that the entries are unordered, makes it a pain to work with. Rather than doing multiple passes through the file to collect all the relevant records, I've usually defaulted on an inelegant and inefficient approach of loading all the genes into memory, which is suboptimal and obviously does not scale. (As an example, I include a script of mine below that uses the current library to add introns to a GFF. With the right GFF parsing library, this could code be much improved.) Related to this, GFF is missing a random access indexing format, and so I wrote my own extremely primitive code (https://github.com/yarden/MISO/blob/fastmiso/misopy/index_gff.py) to at least make it so you can easily fetch an entire gene "unit" (all of the gene's records) by the ID= of the gene. Since the gene entries of GFF are independent, and since the problem I was working on is embarrassingly parallelizable (each gene can be processed independently on a cluster system), I needed a way to instantly retrieve a parsed gene unit in a way that wouldn't cause deadlock for multiple threads. I'm generally unhappy with the code and would like to find a library that replaces my current clunky solution. A library that uses C (or Cython) but has a Python wrapper sounds like the best solution to me. I'm hoping that this library could be used in my project (https://github.com/yarden/MISO) instead of the GFF parser. Related to the library, I'd like to find a tool that allows slicing/viewing/retrieval of GFF3 files. I really liked the way you framed the core tasks on your gffutils README, and I was imagining the same set of features, except in a form also accessible from the command line. If I combine your requirements and mine, I think the core features would be something like: Feature 1: A "grep" like feature for gene units:
Ideally, there would be a 'limit' feature (related to BCBio parser) that allows you to "clip" the hierarchy somewhere.
However, this can also be achieved by a feature analogous to grep -V that excludes out entries:
Feature 2: Retrieve all exons common to each gene's mRNAs
(as you described in your README.) Feature 3: This one is less essential in my view, but it'd be nice to retrieve entries by region coordinates:
In any case, combining features #1 and #2, you can easily now do things per gene or set of genes, which I think would be
Feature 4: Trivial but useful, a kind of "sanitize" or clean up (like your current "clean_gff") utility that checks the GFF for incomplete entries and fixes things like start > end. It would allow convenient piping like this:
In summary, the ideal library would allow one to programmatically do the features required to achieve Features #1-#4. The only thing that (for my use case at least) would be indispensable in a library, but unrelated to the command line, is the random access index for GFF. I'd love to be able to run distributed code using the library and get random access to very large GFF files this way. The way I do it now is again pretty inelegant, using index_gff.py. It cannot result in deadlock on a distributed system as mentioned, but the disadvantage is that it creates a proliferation of files and probably additional I/O overhead. However, the indexing cost is minimal (takes ~10 mins per large GFF) and I think any user would be happy to pay this minimal startup cost to build an index if it later on allows random access by gene and/or coordinate regions. Also, I definitely agree that your idea that universal GFF <-> GTF <-> ... etc conversions would be very useful. One approach to do this is to pick the format that is internally best (I think that might be GTF, since it might be the most "explicit") and just internally convert all inputs to that format. For example, if a GFF file is passed, it can be internally converted to GTF and represented that way and then optionally serialized as GFF or GTF. Best, --Yarden P.S. @chapmanb 's BCBio GFF parser (http://bcbio.wordpress.com/2009/03/08/initial-gff-parser-for-biopython/) looks really nice from what I can tell, but the dependency on BioPython makes it unusable for me as a library, since I'd like to package this with MISO and users are already complaining about the various Python dependencies (in fact, most people have trouble installing scipy and I'm working to remove this requirement - unfortunate since scipy is valuable.) I also don't use BioPython for anything else at the moment, so wouldn't want to tack it on as a dependency just for GFF parsing abilities.
|
One related question for you with respect to using Anyway, I've been playing with gffutils now and think it's really great. If you'd like code contributions, let me know what the best way to proceed is (I can fork it and submit patches that are of potential interest, etc.) Thanks, --Yarden |
Yarden and Ryan; My GFF parser exposes Biopython objects but the internals use a simple dictionary to represent GFF lines so this could be abstracted away if the dependency is a problem but the code is of interest. As an aside, Biopython is a much more lightweight dependency than SciPy: it has no hard dependencies on other packages. From my point of view it would be great to have a single underlying Python parser and general representation. This could be supplemented with add ons like indexing and Biopython object creation. Let me know how I can help, |
Yarden & Brad: It's great to get you guys involved. I'll try and go through your comments sequentially . . . Yarden (API): The piping CLI API looks very cool. Most of the functionality you're looking for is already implemented, at least in un-piped workflows. The challenge I think is in determining what is piped to other commands. For example, this command:
would presumably create a db if one doesn't already exist, and return the results to stdout as a subset of the original GFF. Upon piping though,
the second command would (I think) have to convert stdout to another temporary db, and then operate on that. Though I suppose whether or not a full db is created will depend on the command. So while the piping business would be very cool, I think it will take some thought to get right. I may open another issue to discuss this specific topic to keep things organized. Yarden (multiple threads): Yarden/Brad (dependencies): Yarden/Brad (contributions): I'll email you each a document that may help orient you to the code and the strategies I ended up using so that you're not doing a cold read of the source. In the end, I think a hybrid of Brad's parser and the sqlite db backend would make for a nice self-contained library. |
Hi Brad and Ryan, I think it'll be great to work on this together, and as Brad said, I'm happy to contribute code/tests/examples etc. I'll respond to some of Ryan's points below: API:
The most intuitive behavior for this in my view would be printing a plain GFF output to I agree that relying on a db complicates the piping that I mentioned, though I think that there are features that don't require a db at (typically minor) speed expenses. For example, I second Ryan's comment that a combination of Brad's parser and sqlite db would be ideal. I like sqlite a lot for its portability, lack of dependencies (just needs Python) and the fact that other people using other languages can parse/query the database (which is not true of pickle or shelve.) dependencies: |
As of 73d7df0 I added the script It's set up to work like git/samtools/bedtools etc with subcommands. Creation:
Fetching (essentially using the db simply as an indexed GFF):
Up or down the hierarchy:
Full-text search on the attributes field; providing a featuretype dramatically speeds things up over grep:
Still working on the subcommands (side note: I am now a fan of the |
@chapmanb pointed this possible but quite ambitious Biopython GSoC 2013 idea of mine has some overlap (but what I had in mind was much more general than just GFF covering other annotation rich sequence formats as well): |
What sort of functionality should be exposed to the CLI?
For example a universal GTF<->GFF<->refFlat conversion script would be useful.
The text was updated successfully, but these errors were encountered: