Inserting/shifting annotations #42

rmzelle · 2015-01-09T20:21:45Z

In my field of research, it is very common to remove and/or insert custom DNA sequences in microbial genomes. I would like to make the same changes in silico in the FASTA file of the reference genome, as well as in the accompanying GFF annotation file.

To make this easy, I would love the following features:

conversion of GenBank annotations to the GFF format, since we save our genomic modifications in GenBank format. For this I found http://biopython.org/wiki/GFF_Parsing#Converting_other_formats_to_GFF3 (the author mentions at https://github.com/chapmanb/bcbb/tree/master/gff that he hopes to eventually integrate this code into gffutils).
an easy way to introduce annotations from a second GFF file (describing the newly introduced sequence) at a specified location in the first GFF file (describing the genome). For example, I might want to replace a 4 kb gene in GFF file A by a 14 kb DNA fragment described in GFF file B. It would be great if gffutils had functions for: a) deleting annotations that fall in the deleted 4 kb region, b) deleting, splitting, or stretching annotations that partially overlap or span the deleted 4 kb region, c) shifting the coordinates of any annotations downstream of the 4 kb integration site by 10 kb (14 - 4 kb), and d) adding the annotations from GFF file B to GFF file A while updating their coordinates.

Does this fall in the scope of gffutils, or are my requests too esoteric? I guess the simplest feature request would be the ability to easily select and/or shift annotations that start beyond a certain coordinate. (I realize that I can already make such changes by iterating over each feature, but it would still be nice to have such a function)

daler · 2015-01-09T20:53:14Z

To me, this seems too specific. But I don't work with microbial genomes so it's possible I'm missing whole swaths of use-cases.

The GenBank -> GFF should be relatively straightforward. Integration with Bio.SeqRecord objects has been on the list for a while (issue #3 (!)); I just never got around to it.

The second request takes some thought and would be a bit of work to make it robust. This problem sort of reminds me of GATK's FastaAlternateReferenceMaker. I had tried to re-implement the idea at one point, but it gets tricky very quickly, especially with multiple overlapping modifications to make.

If you could get together the following (maybe as a Gist?):

Some examples. Say, an original file A, original file B, and updated file A with a couple different versions of each that show tricky modifications.
Some rough code that outlines the workflow your looking for

then I can work on getting it into gffutils proper.

rmzelle · 2015-01-10T21:11:05Z

@daler, thanks for considering my requests! Just as a bit more background: I work in the field of industrial microbiology. It's very common there to combine targeted genetic engineering (like gene deletions, overexpression of heterologous genes, etc.) with mutagenesis and selection. Once we select a strain with improved performance, we usually want to identify the genomic changes that led to the improved phenotype. It's starting to become routine to do this via whole-genome sequencing, which is now affordable (we can now get ca. 80x coverage Illumina MiSeq v3 sequencing for a baker's yeast strain for as little as $250). The required bioinformatics are still a bit daunting, though, especially for groups without dedicated bioinformaticians. Having a reference genome that incorporates all the targeted genetic modifications would make it much more straightforward to verify that the modifications were made correctly, and would allow us to see whether any mutations occurred in these engineered regions during strain selection.

I searched for tools that allow you to easily edit the FASTA and GFF files of a genome, but I didn't really find anything. gffutils comes close, though.

I created a Gist with an example of the workflow I'm looking for. I couldn't really come up with any tricky modifications. See https://gist.github.com/rmzelle/7eed15b12edd46bf7474#comment-1369934

daler · 2015-01-12T18:36:41Z

Thanks, the use-case makes a lot more sense now.

Your example is very clear, and as long as you only do single insertions or deletions as in your example, it's very straightforward.

I don't have time to work on this right now, but I will get to it as soon as I can.

rmzelle · 2015-01-12T18:43:09Z

Great, thanks! I might try to hack something together in the meantime, but I'd love to have this feature supported.

Our annotation file currently only contains gene features, so there is rarely any overlap between the deleted feature at the place of the insertion (the gene we replace) and the remaining neighboring features. Although I guess such cases should be addressed eventually to accommodate richer annotations and more diverse types of modifications.

see #42

daler · 2015-01-18T21:45:50Z

After thinking about this, I'd prefer to not have the specific use-case be built-in to gffutils. It would be better off living in a separate package that simply imports gffutils and has the functionality you're looking for.

I do want to make sure gfftutils has the general-purpose pieces in place to do what you need, so I've added FeatureDB.delete and the biopython_integration module (87c2a37 and f82fd2e) to be able to do things you describe in the gist.

The general workflow would be to write iterators for things you want to add and things you want to remove, and then use FeatureDB.delete and FeatureDB.update with them. Here are some example building blocks:

# -----------------------------------------------------
# Update a database with features from a GenBank file
#
import gffutils.biopython_integration as bpi
def gen():
    "yields gffutils.Feature objects from a genbank file"
    for rec in Bio.SeqIO.parse('ex.gb', 'genbank'):
        for seqfeature in rec.features:
            yield bpi.from_seqfeature(seqfeature)

db.update(gen())

# -----------------------------------------------------
# Delete features from within a region
#
db.delete(db.region('scaffold_13:271666-273390'))


# -----------------------------------------------------
# Shift the start/stop coords of features by the deletion size
#
def gen2(features, n):
    "yields features with coords shifted by n bp"
    for feature in features:
        feature.start += n
        feature.stop += n
        yield feature

# select features that come after the deletion region
features_to_adjust = db.region('scaffold_13:273390-99999999')

# adjust 'em
n = 271666 - 273390
adjusted_features = gen2(features_to_adjust, n)

# update db, replacing existing features with new adjusted ones
db.update(adjusted_features, merge_strategy='replace')

rmzelle · 2015-01-19T02:49:32Z

Thanks, that looks pretty straight-forward. (I'll have to read up on Python generators, though! http://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do-in-python)

rmzelle · 2015-01-19T21:36:38Z

For FeatureDB.delete, would it make sense to add a toggle of whether features that overlap but not entirely fall within the deleted region should be removed as well? (e.g. completely_within=False)

Edit: nevermind, I can just do db.delete(db.region('scaffold_13:271666-273390', completely_within=False))

rmzelle · 2015-01-19T22:17:06Z

features_to_adjust = db.region('scaffold_13:273390-99999999')

The "99999999" isn't very robust. And if I use a value here that's too large, with a few extra 9s, I get:

gffutils\interface.py", line 500, in region
    c.execute(query, tuple(args))
OperationalError: too many SQL variables

Is there a way to select all the features that belong to a seqid without specifying a start and end? That way I could just use FeatureDB.features_of_type with order_by="end" and check feature.end of the first feature.

Alternatively, it would be nice to be able to use an open-ended region (e.g. "scaffold_13:273390-end").

rmzelle · 2015-01-20T19:29:33Z

FeatureDB.delete seems to work great, but I'm having issues with db.update().

When I try to update my database with new features using:

    db = gffutils.create_db(genomeGFF, dbfn=genomeGFFdb, force=True, merge_strategy="merge")
    dbInsert = gffutils.create_db(insertGFF, dbfn=insertGFFdb, force=True, merge_strategy="merge")    
    newFeatures = dbInsert.all_features()
    db.update(newFeatures, merge_strategy="replace")

I get

Traceback (most recent call last):
  File ".\update-reference-genome-1.1.py", line 238, in <module>
    updateGenome("genome.gff", "scaffold_4", 867901, 869033, "insert.gff")
  File ".\update-reference-genome-1.1.py", line 137, in updateGenome
    db.update(newFeatures, merge_strategy="replace")
  File "\WinPython-64bit-2.7.9.1\python-2.7.9.amd64\lib\site-packages\gffutils\interface.py", line 679, in update
    db._populate_from_lines(features)
  File "\WinPython-64bit-2.7.9.1\python-2.7.9.amd64\lib\site-packages\gffutils\create.py", line 555, in _populate_from_lines
    raise ValueError("No lines parsed -- was an empty file provided?")
ValueError: No lines parsed -- was an empty file provided?

Sorry to bother you again, but could you have a look whether this works? I'm using the latest version of gffutils from "master".

daler · 2015-01-20T20:41:52Z

As for your question about selecting regions from a chromosome, this seems reasonable and will be a good addition. I'll add a new issue for this.

As for the db.update() question, I can take a look if you can send a minimal, self-contained example that reproduces the problem.

daler · 2015-01-20T20:48:49Z

See #44 -- does this interface make sense for selecting all features from chromosome?

rmzelle · 2015-01-20T21:40:10Z

See https://gist.github.com/rmzelle/dbb504d2e64030023b2f for the self-contained example. When I run it with gffutils from commit 398b6f5 I get the same error as above.

daler · 2015-01-21T16:30:43Z

Got it, thanks. I'll take a look when I get a chance.

daler · 2015-01-22T15:55:47Z

As for the db.update question, see my comments on the gist -- the error you're getting is from providing a generator that's already been consumed.

rmzelle · 2015-01-23T17:25:31Z

@daler, looks like gists don't support notifications, so please see my comments at:

https://gist.github.com/rmzelle/dbb504d2e64030023b2f#comment-1378197 (original minimal example)
https://gist.github.com/rmzelle/22fc9c537ddf8f40de4b (new minimal example)

I managed to get around the issue in the second example (and I amended the gist), but I think that something doesn't work quite right for db.update() in gffutils, or otherwise I'm still confused about how to properly use it.

daler · 2015-01-26T17:19:02Z

OK, there was an issue where merge and replace strategies were being treated similarly (because I had naively assumed that really only attributes would be changing, rather than genomic coordinates as in this case).

This is now fixed, so you can now delete the db.delete(db.region(...)) line in your example gist https://gist.github.com/rmzelle/22fc9c537ddf8f40de4b and your example works.

rmzelle · 2015-01-26T17:26:04Z

Awesome, that seems to work perfectly. Thanks again for the quick support! Maybe time for 0.8.3? :)

daler · 2015-01-26T17:34:03Z

Sure -- maybe play around with them some more to see if they make sense and are working correctly for your? If they seem OK I can do a release.

rmzelle · 2015-01-26T17:35:06Z

I'll let you know if I notice anything off.

rmzelle · 2015-01-28T18:21:37Z

I didn't come across any issues.

daler · 2015-01-28T19:25:27Z

OK, thanks. Just released v0.8.3 on PyPI.

I'm closing this for now. If you run into any future issues, please open a separate issue -- it'll be easier for me to keep track of.

rmzelle · 2015-01-29T20:14:28Z

Are new releases supposed to show up immediately on PyPI? It still shows 0.8.2: https://pypi.python.org/pypi/gffutils/

daler · 2015-01-29T23:37:22Z

Ugh, I pushed the tag to github and uploaded the docs but didn't do the PyPI upload. Been a long week . . . it should be up now. Sorry about that.

daler added the feature request label Jan 16, 2015

daler added a commit that referenced this issue Jan 18, 2015

initial support for biopython SeqFeatures

f82fd2e

see #42

daler mentioned this issue Jan 20, 2015

more flexible FeatureDB.region #44

Closed

daler closed this as completed Jan 28, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inserting/shifting annotations #42

Inserting/shifting annotations #42

rmzelle commented Jan 9, 2015

daler commented Jan 9, 2015

rmzelle commented Jan 10, 2015

daler commented Jan 12, 2015

rmzelle commented Jan 12, 2015

daler commented Jan 18, 2015

rmzelle commented Jan 19, 2015

rmzelle commented Jan 19, 2015

rmzelle commented Jan 19, 2015

rmzelle commented Jan 20, 2015

daler commented Jan 20, 2015

daler commented Jan 20, 2015

rmzelle commented Jan 20, 2015

daler commented Jan 21, 2015

daler commented Jan 22, 2015

rmzelle commented Jan 23, 2015

daler commented Jan 26, 2015

rmzelle commented Jan 26, 2015

daler commented Jan 26, 2015

rmzelle commented Jan 26, 2015

rmzelle commented Jan 28, 2015

daler commented Jan 28, 2015

rmzelle commented Jan 29, 2015

daler commented Jan 29, 2015

Inserting/shifting annotations #42

Inserting/shifting annotations #42

Comments

rmzelle commented Jan 9, 2015

daler commented Jan 9, 2015

rmzelle commented Jan 10, 2015

daler commented Jan 12, 2015

rmzelle commented Jan 12, 2015

daler commented Jan 18, 2015

rmzelle commented Jan 19, 2015

rmzelle commented Jan 19, 2015

rmzelle commented Jan 19, 2015

rmzelle commented Jan 20, 2015

daler commented Jan 20, 2015

daler commented Jan 20, 2015

rmzelle commented Jan 20, 2015

daler commented Jan 21, 2015

daler commented Jan 22, 2015

rmzelle commented Jan 23, 2015

daler commented Jan 26, 2015

rmzelle commented Jan 26, 2015

daler commented Jan 26, 2015

rmzelle commented Jan 26, 2015

rmzelle commented Jan 28, 2015

daler commented Jan 28, 2015

rmzelle commented Jan 29, 2015

daler commented Jan 29, 2015