Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inserting/shifting annotations #42

Closed
rmzelle opened this issue Jan 9, 2015 · 23 comments
Closed

Inserting/shifting annotations #42

rmzelle opened this issue Jan 9, 2015 · 23 comments

Comments

@rmzelle
Copy link
Contributor

rmzelle commented Jan 9, 2015

In my field of research, it is very common to remove and/or insert custom DNA sequences in microbial genomes. I would like to make the same changes in silico in the FASTA file of the reference genome, as well as in the accompanying GFF annotation file.

To make this easy, I would love the following features:

  • conversion of GenBank annotations to the GFF format, since we save our genomic modifications in GenBank format. For this I found http://biopython.org/wiki/GFF_Parsing#Converting_other_formats_to_GFF3 (the author mentions at https://github.com/chapmanb/bcbb/tree/master/gff that he hopes to eventually integrate this code into gffutils).
  • an easy way to introduce annotations from a second GFF file (describing the newly introduced sequence) at a specified location in the first GFF file (describing the genome). For example, I might want to replace a 4 kb gene in GFF file A by a 14 kb DNA fragment described in GFF file B. It would be great if gffutils had functions for: a) deleting annotations that fall in the deleted 4 kb region, b) deleting, splitting, or stretching annotations that partially overlap or span the deleted 4 kb region, c) shifting the coordinates of any annotations downstream of the 4 kb integration site by 10 kb (14 - 4 kb), and d) adding the annotations from GFF file B to GFF file A while updating their coordinates.

Does this fall in the scope of gffutils, or are my requests too esoteric? I guess the simplest feature request would be the ability to easily select and/or shift annotations that start beyond a certain coordinate. (I realize that I can already make such changes by iterating over each feature, but it would still be nice to have such a function)

@daler
Copy link
Owner

daler commented Jan 9, 2015

To me, this seems too specific. But I don't work with microbial genomes so it's possible I'm missing whole swaths of use-cases.

The GenBank -> GFF should be relatively straightforward. Integration with Bio.SeqRecord objects has been on the list for a while (issue #3 (!)); I just never got around to it.

The second request takes some thought and would be a bit of work to make it robust. This problem sort of reminds me of GATK's FastaAlternateReferenceMaker. I had tried to re-implement the idea at one point, but it gets tricky very quickly, especially with multiple overlapping modifications to make.

If you could get together the following (maybe as a Gist?):

  • Some examples. Say, an original file A, original file B, and updated file A with a couple different versions of each that show tricky modifications.
  • Some rough code that outlines the workflow your looking for

then I can work on getting it into gffutils proper.

@rmzelle
Copy link
Contributor Author

rmzelle commented Jan 10, 2015

@daler, thanks for considering my requests! Just as a bit more background: I work in the field of industrial microbiology. It's very common there to combine targeted genetic engineering (like gene deletions, overexpression of heterologous genes, etc.) with mutagenesis and selection. Once we select a strain with improved performance, we usually want to identify the genomic changes that led to the improved phenotype. It's starting to become routine to do this via whole-genome sequencing, which is now affordable (we can now get ca. 80x coverage Illumina MiSeq v3 sequencing for a baker's yeast strain for as little as $250). The required bioinformatics are still a bit daunting, though, especially for groups without dedicated bioinformaticians. Having a reference genome that incorporates all the targeted genetic modifications would make it much more straightforward to verify that the modifications were made correctly, and would allow us to see whether any mutations occurred in these engineered regions during strain selection.

I searched for tools that allow you to easily edit the FASTA and GFF files of a genome, but I didn't really find anything. gffutils comes close, though.

I created a Gist with an example of the workflow I'm looking for. I couldn't really come up with any tricky modifications. See https://gist.github.com/rmzelle/7eed15b12edd46bf7474#comment-1369934

@daler
Copy link
Owner

daler commented Jan 12, 2015

Thanks, the use-case makes a lot more sense now.

Your example is very clear, and as long as you only do single insertions or deletions as in your example, it's very straightforward.

I don't have time to work on this right now, but I will get to it as soon as I can.

@rmzelle
Copy link
Contributor Author

rmzelle commented Jan 12, 2015

Great, thanks! I might try to hack something together in the meantime, but I'd love to have this feature supported.

Our annotation file currently only contains gene features, so there is rarely any overlap between the deleted feature at the place of the insertion (the gene we replace) and the remaining neighboring features. Although I guess such cases should be addressed eventually to accommodate richer annotations and more diverse types of modifications.

daler added a commit that referenced this issue Jan 18, 2015
@daler
Copy link
Owner

daler commented Jan 18, 2015

After thinking about this, I'd prefer to not have the specific use-case be built-in to gffutils. It would be better off living in a separate package that simply imports gffutils and has the functionality you're looking for.

I do want to make sure gfftutils has the general-purpose pieces in place to do what you need, so I've added FeatureDB.delete and the biopython_integration module (87c2a37 and f82fd2e) to be able to do things you describe in the gist.

The general workflow would be to write iterators for things you want to add and things you want to remove, and then use FeatureDB.delete and FeatureDB.update with them. Here are some example building blocks:

# -----------------------------------------------------
# Update a database with features from a GenBank file
#
import gffutils.biopython_integration as bpi
def gen():
    "yields gffutils.Feature objects from a genbank file"
    for rec in Bio.SeqIO.parse('ex.gb', 'genbank'):
        for seqfeature in rec.features:
            yield bpi.from_seqfeature(seqfeature)

db.update(gen())

# -----------------------------------------------------
# Delete features from within a region
#
db.delete(db.region('scaffold_13:271666-273390'))


# -----------------------------------------------------
# Shift the start/stop coords of features by the deletion size
#
def gen2(features, n):
    "yields features with coords shifted by n bp"
    for feature in features:
        feature.start += n
        feature.stop += n
        yield feature

# select features that come after the deletion region
features_to_adjust = db.region('scaffold_13:273390-99999999')

# adjust 'em
n = 271666 - 273390
adjusted_features = gen2(features_to_adjust, n)

# update db, replacing existing features with new adjusted ones
db.update(adjusted_features, merge_strategy='replace')

@rmzelle
Copy link
Contributor Author

rmzelle commented Jan 19, 2015

Thanks, that looks pretty straight-forward. (I'll have to read up on Python generators, though! http://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do-in-python)

@rmzelle
Copy link
Contributor Author

rmzelle commented Jan 19, 2015

For FeatureDB.delete, would it make sense to add a toggle of whether features that overlap but not entirely fall within the deleted region should be removed as well? (e.g. completely_within=False)

Edit: nevermind, I can just do db.delete(db.region('scaffold_13:271666-273390', completely_within=False))

@rmzelle
Copy link
Contributor Author

rmzelle commented Jan 19, 2015

features_to_adjust = db.region('scaffold_13:273390-99999999')

The "99999999" isn't very robust. And if I use a value here that's too large, with a few extra 9s, I get:

gffutils\interface.py", line 500, in region
    c.execute(query, tuple(args))
OperationalError: too many SQL variables

Is there a way to select all the features that belong to a seqid without specifying a start and end? That way I could just use FeatureDB.features_of_type with order_by="end" and check feature.end of the first feature.

Alternatively, it would be nice to be able to use an open-ended region (e.g. "scaffold_13:273390-end").

@rmzelle
Copy link
Contributor Author

rmzelle commented Jan 20, 2015

FeatureDB.delete seems to work great, but I'm having issues with db.update().

When I try to update my database with new features using:

    db = gffutils.create_db(genomeGFF, dbfn=genomeGFFdb, force=True, merge_strategy="merge")
    dbInsert = gffutils.create_db(insertGFF, dbfn=insertGFFdb, force=True, merge_strategy="merge")    
    newFeatures = dbInsert.all_features()
    db.update(newFeatures, merge_strategy="replace")

I get

Traceback (most recent call last):
  File ".\update-reference-genome-1.1.py", line 238, in <module>
    updateGenome("genome.gff", "scaffold_4", 867901, 869033, "insert.gff")
  File ".\update-reference-genome-1.1.py", line 137, in updateGenome
    db.update(newFeatures, merge_strategy="replace")
  File "\WinPython-64bit-2.7.9.1\python-2.7.9.amd64\lib\site-packages\gffutils\interface.py", line 679, in update
    db._populate_from_lines(features)
  File "\WinPython-64bit-2.7.9.1\python-2.7.9.amd64\lib\site-packages\gffutils\create.py", line 555, in _populate_from_lines
    raise ValueError("No lines parsed -- was an empty file provided?")
ValueError: No lines parsed -- was an empty file provided?

Sorry to bother you again, but could you have a look whether this works? I'm using the latest version of gffutils from "master".

@daler
Copy link
Owner

daler commented Jan 20, 2015

As for your question about selecting regions from a chromosome, this seems reasonable and will be a good addition. I'll add a new issue for this.

As for the db.update() question, I can take a look if you can send a minimal, self-contained example that reproduces the problem.

@daler
Copy link
Owner

daler commented Jan 20, 2015

See #44 -- does this interface make sense for selecting all features from chromosome?

@rmzelle
Copy link
Contributor Author

rmzelle commented Jan 20, 2015

See https://gist.github.com/rmzelle/dbb504d2e64030023b2f for the self-contained example. When I run it with gffutils from commit 398b6f5 I get the same error as above.

@daler
Copy link
Owner

daler commented Jan 21, 2015

Got it, thanks. I'll take a look when I get a chance.

@daler
Copy link
Owner

daler commented Jan 22, 2015

As for the db.update question, see my comments on the gist -- the error you're getting is from providing a generator that's already been consumed.

@rmzelle
Copy link
Contributor Author

rmzelle commented Jan 23, 2015

@daler, looks like gists don't support notifications, so please see my comments at:

https://gist.github.com/rmzelle/dbb504d2e64030023b2f#comment-1378197 (original minimal example)
https://gist.github.com/rmzelle/22fc9c537ddf8f40de4b (new minimal example)

I managed to get around the issue in the second example (and I amended the gist), but I think that something doesn't work quite right for db.update() in gffutils, or otherwise I'm still confused about how to properly use it.

@daler
Copy link
Owner

daler commented Jan 26, 2015

OK, there was an issue where merge and replace strategies were being treated similarly (because I had naively assumed that really only attributes would be changing, rather than genomic coordinates as in this case).

This is now fixed, so you can now delete the db.delete(db.region(...)) line in your example gist https://gist.github.com/rmzelle/22fc9c537ddf8f40de4b and your example works.

@rmzelle
Copy link
Contributor Author

rmzelle commented Jan 26, 2015

Awesome, that seems to work perfectly. Thanks again for the quick support! Maybe time for 0.8.3? :)

@daler
Copy link
Owner

daler commented Jan 26, 2015

Sure -- maybe play around with them some more to see if they make sense and are working correctly for your? If they seem OK I can do a release.

@rmzelle
Copy link
Contributor Author

rmzelle commented Jan 26, 2015

I'll let you know if I notice anything off.

@rmzelle
Copy link
Contributor Author

rmzelle commented Jan 28, 2015

I didn't come across any issues.

@daler
Copy link
Owner

daler commented Jan 28, 2015

OK, thanks. Just released v0.8.3 on PyPI.

I'm closing this for now. If you run into any future issues, please open a separate issue -- it'll be easier for me to keep track of.

@daler daler closed this as completed Jan 28, 2015
@rmzelle
Copy link
Contributor Author

rmzelle commented Jan 29, 2015

Are new releases supposed to show up immediately on PyPI? It still shows 0.8.2: https://pypi.python.org/pypi/gffutils/

@daler
Copy link
Owner

daler commented Jan 29, 2015

Ugh, I pushed the tag to github and uploaded the docs but didn't do the PyPI upload. Been a long week . . . it should be up now. Sorry about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants