-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inserting/shifting annotations #42
Comments
To me, this seems too specific. But I don't work with microbial genomes so it's possible I'm missing whole swaths of use-cases. The GenBank -> GFF should be relatively straightforward. Integration with The second request takes some thought and would be a bit of work to make it robust. This problem sort of reminds me of GATK's FastaAlternateReferenceMaker. I had tried to re-implement the idea at one point, but it gets tricky very quickly, especially with multiple overlapping modifications to make. If you could get together the following (maybe as a Gist?):
then I can work on getting it into gffutils proper. |
@daler, thanks for considering my requests! Just as a bit more background: I work in the field of industrial microbiology. It's very common there to combine targeted genetic engineering (like gene deletions, overexpression of heterologous genes, etc.) with mutagenesis and selection. Once we select a strain with improved performance, we usually want to identify the genomic changes that led to the improved phenotype. It's starting to become routine to do this via whole-genome sequencing, which is now affordable (we can now get ca. 80x coverage Illumina MiSeq v3 sequencing for a baker's yeast strain for as little as $250). The required bioinformatics are still a bit daunting, though, especially for groups without dedicated bioinformaticians. Having a reference genome that incorporates all the targeted genetic modifications would make it much more straightforward to verify that the modifications were made correctly, and would allow us to see whether any mutations occurred in these engineered regions during strain selection. I searched for tools that allow you to easily edit the FASTA and GFF files of a genome, but I didn't really find anything. gffutils comes close, though. I created a Gist with an example of the workflow I'm looking for. I couldn't really come up with any tricky modifications. See https://gist.github.com/rmzelle/7eed15b12edd46bf7474#comment-1369934 |
Thanks, the use-case makes a lot more sense now. Your example is very clear, and as long as you only do single insertions or deletions as in your example, it's very straightforward. I don't have time to work on this right now, but I will get to it as soon as I can. |
Great, thanks! I might try to hack something together in the meantime, but I'd love to have this feature supported. Our annotation file currently only contains gene features, so there is rarely any overlap between the deleted feature at the place of the insertion (the gene we replace) and the remaining neighboring features. Although I guess such cases should be addressed eventually to accommodate richer annotations and more diverse types of modifications. |
After thinking about this, I'd prefer to not have the specific use-case be built-in to I do want to make sure The general workflow would be to write iterators for things you want to add and things you want to remove, and then use # -----------------------------------------------------
# Update a database with features from a GenBank file
#
import gffutils.biopython_integration as bpi
def gen():
"yields gffutils.Feature objects from a genbank file"
for rec in Bio.SeqIO.parse('ex.gb', 'genbank'):
for seqfeature in rec.features:
yield bpi.from_seqfeature(seqfeature)
db.update(gen())
# -----------------------------------------------------
# Delete features from within a region
#
db.delete(db.region('scaffold_13:271666-273390'))
# -----------------------------------------------------
# Shift the start/stop coords of features by the deletion size
#
def gen2(features, n):
"yields features with coords shifted by n bp"
for feature in features:
feature.start += n
feature.stop += n
yield feature
# select features that come after the deletion region
features_to_adjust = db.region('scaffold_13:273390-99999999')
# adjust 'em
n = 271666 - 273390
adjusted_features = gen2(features_to_adjust, n)
# update db, replacing existing features with new adjusted ones
db.update(adjusted_features, merge_strategy='replace') |
Thanks, that looks pretty straight-forward. (I'll have to read up on Python generators, though! http://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do-in-python) |
For Edit: nevermind, I can just do |
The "99999999" isn't very robust. And if I use a value here that's too large, with a few extra 9s, I get:
Is there a way to select all the features that belong to a seqid without specifying a start and end? That way I could just use Alternatively, it would be nice to be able to use an open-ended region (e.g. "scaffold_13:273390-end"). |
FeatureDB.delete seems to work great, but I'm having issues with db.update(). When I try to update my database with new features using:
I get
Sorry to bother you again, but could you have a look whether this works? I'm using the latest version of gffutils from "master". |
As for your question about selecting regions from a chromosome, this seems reasonable and will be a good addition. I'll add a new issue for this. As for the |
See #44 -- does this interface make sense for selecting all features from chromosome? |
See https://gist.github.com/rmzelle/dbb504d2e64030023b2f for the self-contained example. When I run it with gffutils from commit 398b6f5 I get the same error as above. |
Got it, thanks. I'll take a look when I get a chance. |
As for the |
@daler, looks like gists don't support notifications, so please see my comments at: https://gist.github.com/rmzelle/dbb504d2e64030023b2f#comment-1378197 (original minimal example) I managed to get around the issue in the second example (and I amended the gist), but I think that something doesn't work quite right for db.update() in gffutils, or otherwise I'm still confused about how to properly use it. |
OK, there was an issue where merge and replace strategies were being treated similarly (because I had naively assumed that really only attributes would be changing, rather than genomic coordinates as in this case). This is now fixed, so you can now delete the |
Awesome, that seems to work perfectly. Thanks again for the quick support! Maybe time for 0.8.3? :) |
Sure -- maybe play around with them some more to see if they make sense and are working correctly for your? If they seem OK I can do a release. |
I'll let you know if I notice anything off. |
I didn't come across any issues. |
OK, thanks. Just released v0.8.3 on PyPI. I'm closing this for now. If you run into any future issues, please open a separate issue -- it'll be easier for me to keep track of. |
Are new releases supposed to show up immediately on PyPI? It still shows 0.8.2: https://pypi.python.org/pypi/gffutils/ |
Ugh, I pushed the tag to github and uploaded the docs but didn't do the PyPI upload. Been a long week . . . it should be up now. Sorry about that. |
In my field of research, it is very common to remove and/or insert custom DNA sequences in microbial genomes. I would like to make the same changes in silico in the FASTA file of the reference genome, as well as in the accompanying GFF annotation file.
To make this easy, I would love the following features:
Does this fall in the scope of gffutils, or are my requests too esoteric? I guess the simplest feature request would be the ability to easily select and/or shift annotations that start beyond a certain coordinate. (I realize that I can already make such changes by iterating over each feature, but it would still be nice to have such a function)
The text was updated successfully, but these errors were encountered: