Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graph database? #35

Open
olgabot opened this issue May 10, 2014 · 5 comments
Open

Graph database? #35

olgabot opened this issue May 10, 2014 · 5 comments

Comments

@olgabot
Copy link

olgabot commented May 10, 2014

This is not an issue, more a question. It takes some serious SQL-wrangling to get parent-child or grandparent-child information about gene-transcript-exon relationships. Have you thought about using a graph database for gffutils? There doesn't seem to be a SQLite equivalent for Node.js or TitanDB so you wouldn't have to open up a separate port, so that could be a drawback.

@yarden
Copy link
Contributor

yarden commented May 11, 2014

It's a pain partly because of the GFF specification. GFFs only encode trees, so full graph support is not needed, but the format is bad at supporting them. To somewhat get around this, I clean up ("sanitize") all my GFFs to have a field that runs through each gene hierarchy (see gff_sanitize and core code here: http://pythonhosted.org/gffutils/autodocs/gffutils.helpers.sanitize_gff_db.html). This makes the files grep-able and easy to query with SQL. It makes the GFF more GTF-like. This only makes sense for canonical gene -> mRNA -> exons hierarchies.

Also, there's some support for iterating over parent-child pairs for canonical hierarchies that might make what you're trying to do easier: iter_by_parent_childs in http://pythonhosted.org/gffutils/autodocs/gffutils.FeatureDB.html

@daler
Copy link
Owner

daler commented May 12, 2014

I hadn't heard of graph databases until you brought them up. After reading up on them a little, I'm pretty sure they would provide a substantial performance boost. But I wasn't able to find a file-based implementation either, Python or otherwise. Currently for me, managing a separate graph database and server is too much overhead compared to the almost transparent method of using a file-based database.

As Yarden alluded, yes the SQL can be awkward. But ideally, as many manipulations as possible would be hidden to the end-user. In previous gffutils iterations, I had tried sqlalchemy to make the SQL a bit more straightforward, but I didn't consider the performance hit of the ORM overhead worth it. I had also tried loading the GFF into a graph structure (I think I had used networkx) and saving a pickle of it for persistence. But loading time turned out to be unacceptable, and the memory usage was another downside. So I went back to using good ol' hand-written queries with sqlite for performance.

Anyway, if I hit upon a use-case that's not already implemented, then I'll typically add a method to FeatureDB. If you have a specific task that's currently awkward/annoying to do in SQL, I'd be happy to add it as a method so others could benefit.

And if you ever find a file-based graph db, please let me know!

@olgabot
Copy link
Author

olgabot commented Aug 14, 2015

Apparently there's a python implemented graph db that's a layer over SQLite: https://github.com/eugene-eeo/graphlite

@daler
Copy link
Owner

daler commented Aug 14, 2015

Thanks, nice find.

So to use this in gffutils it would take some playing around to figure out if 2 databases are needed or if graphlite can work with an existing db (my hunch is the latter based on the docs). Then any of the logic that touches the current relations table would be ported to use graphlite. Then we'd need benchmarks to figure out if there are performance gains that make the additional complexity worth it.

Have you run across cases where gffutils currently doesn't work well or that you think would benefit from a graph db?

@olgabot
Copy link
Author

olgabot commented Aug 14, 2015

For me specifically, I operate mostly on exons so getting an exon from a
particular location and all its transcripts and CDSs is a lot of what I do.
I'll try to come up with a particular example that you can benchmark
against. I'm annotating splicing events for my paper right now so this is
great timing :)

On Fri, Aug 14, 2015 at 9:22 AM Ryan Dale [email protected] wrote:

Thanks, nice find.

So to use this in gffutils it would take some playing around to figure out
if 2 databases are needed or if graphlite can work with an existing db (my
hunch is the latter based on the docs). Then any of the logic that touches
the current relations table would be ported to use graphlite. Then we'd
need benchmarks to figure out if there are performance gains that make the
additional complexity worth it.

Have you run across cases where gffutils currently doesn't work well or
that you think would benefit from a graph db?


Reply to this email directly or view it on GitHub
#35 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants