Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent gene predictions for Pyrodigal and *fixed* Prodigal #21

Closed
oschwengers opened this issue Oct 23, 2022 · 18 comments
Closed
Labels
bug Something isn't working
Milestone

Comments

@oschwengers
Copy link

Hi and thanks so much for working on this. A patched, accelerated and potentially multi-threaded Prodigal alternative is very much needed and your effort very much appreciated!

We considered Pyrodigal as a promising alternative for Prodigal within Bakta and Platon. To add an extra layer of confidence we started a gene prediction comparison on 49 bacterial genomes. The large majority of all genes are equally predicted but unfortunately, some prediction slightly differ and some genes are not predicted by Pyrodigal at all.

We waited to double check everything but as I saw your issue #19 today, I thought this could maybe help to further debug it. @jhahnfeld has conducted the comparison and could provide more info on this.

Here are the results using a locally compiled and patched Prodigal version:

genome strand prodigal_start prodigal_stop prodigal_edge pyrodigal_start pyrodigal_stop pyrodigal_edge
GCF_000008865 -1 2727810 2728103 False
GCF_000009045 -1 955653 955817 False
GCF_000020025 1 1689342 1689500 False
GCF_000022565 -1 3550577 3553432 False
GCF_000067165 1 2867204 2867602 False 2867036 2867602 False
GCF_000195955 1 3960860 3961786 False 3961052 3961786 False
GCF_000195955 1 2052933 2053289 False 2052969 2053289 False
GCF_000218265 -1 1551162 1551524 False
GCF_000222975 -1 1950139 1950465 False
GCF_000987925 -1 331455 331595 False
GCF_001457455 1 2243720 2243851 False
@althonos
Copy link
Owner

Hi Oliver, thanks a lot for reaching out, I'm happy to hear you're considering Pyrodigal for super useful tools like these! I've been advocating to Bakta for some time 😉

I'm very thankful for the test set and the gene prediction comparison analysis, to be fair this is something I should have done myself earlier. I'm not yet entirely sure whether it's related to #19 or not because #19 actually affects the already computed genes by recording the wrong scores for display, but it shouldn't change the prediction of the genes themselves. Nevertheless I'll have a look at these genomes, it could be that in the v1 refactor I did some mistakes while adapting the original code.

Just so that I can reproduce, is this in meta or single mode? I also recently found out that Prodigal was doing some out-of-bound reads of sequence data while training, so I'll see about patching that, but I don't think that this is causing the discrepancies you've been seeing.

@althonos althonos added the bug Something isn't working label Oct 24, 2022
@althonos
Copy link
Owner

Just found the comparison repository, thanks @jhahnfeld for setting this up!

@jhahnfeld
Copy link

Hi Martin,

I just wanted to provide you with the link to the comparison repository, but you already found it :)
It now has a results folder with the current differences.

Just so that I can reproduce, is this in meta or single mode?

We used the single mode.

@althonos
Copy link
Owner

althonos commented Oct 24, 2022

I just tested the comparison by using Pyrodigal where instead of training I loaded the training file from Prodigal: the results matched 100%. So it looks like Pyrodigal has some discrepancies in the training process, I'll have a look ASAP!

@jhahnfeld
Copy link

I updated the prediction mismatches and removed the '-c' flag since the tested genomes are closed. Now there are some more mismatches and completely missing predictions.

@zdk123
Copy link

zdk123 commented Oct 25, 2022

I just want to +1 this effort. Finding a few [minor] differences in our tests as well in meta mode.

@althonos
Copy link
Owner

@zdk123 : If you have some example sequences with inconsistencies in meta mode please share them here! I have identified a bug in the training process that I am working on fixing, but meta mode shouldn't be concerned, so if you see discrepancies there too that another bug that i have to fix 😅

@althonos
Copy link
Owner

althonos commented Oct 26, 2022

Just shipped pre-release 2.0.0-rc.2, which seems to have fixed the issues on closed=True, where I'm not seeing any mismatches anymore. Problem was coming from a weird behaviour of Prodigal when in training mode that I didn't replicate exactly in my port of the score_connection function. (I'm actually thinking that's a bug in the original codebase, where a maximum search algorithm is not actually recording the maximum when in training mode).

Still seeing some mismatches in closed=False however, so I'll keep chasing bugs 🥲

@zdk123
Copy link

zdk123 commented Oct 26, 2022

@althonos Actually I think this may have been the same as issue #19 v2-rc2 seems to have fixed

@althonos
Copy link
Owner

Okay, found and patched a new bug (#22) which fixes the remaining issues on closed=False. On my machine the new v2.0.0-rc.3 release passes all comparison tests from @jhahnfeld's repository with both closed=True and closed=False. Please give it a try, if all is well I'll make a proper v2.0.0 after adding a bit of documentation and some minor fixes!

@oschwengers
Copy link
Author

Thanks a lot for this. We'll have a look at this ASAP.
Just out of curiosity: I'm wondering if this update implies larger code changes that would qualify for a major version bump to 2.0.0? IMHO a minor bump (1.2.0) would be sufficient, too if there are no breaking API changes. This might also help to keep in sync with a potentially upcoming Prodigal 2.0 version.

@althonos
Copy link
Owner

althonos commented Oct 27, 2022

@oschwengers : There are breaking changes coming from #18 because I'm going to add a mandatory argument to all Genes.write_* methods. In addition I'll also probably change some attributes of TrainingInfo that return copies to return views of the underlying buffers. That's why I'm trying to synchronize the release a bit this time, so that I can put all the API breaking changes in there at once.

In general choosing the versioning scheme is always complicated for bindings, but I think in here I'm just going to let Pyrodigal have its own versioning scheme following semver :) And actually Prodigal is at version 2.6.3, so this leaves the door open for Prodigal 3.0.

@oschwengers
Copy link
Author

Ahh, I see. OK, that makes totally sense. Thanks!

@jhahnfeld
Copy link

Hi, thank you for looking into the issues.
I have tested Pyrodigal v2.0.0-rc.3 and still have two mismatches in GCF_000008865 with closed=False. For closed=True, I am getting new mismatches now that were not reported before (I have also added the report of mismatched stop codons.).
As for the potential bug in the score_connection function, it might be good to report it upstream to check if the behavior is intended or not.

@althonos
Copy link
Owner

@jhahnfeld : I'm not seeing any mismatch in closed=True anymore on my end. Are you deleting the tmp folder between runs in closed=False and closed=True modes? If you don't, Prodigal will not perform the training if the training file already exists, so you will end up using the wrong training file. See PR : jhahnfeld/prodigal-pyrodigal-comparison#1

@jhahnfeld
Copy link

Sorry, I forgot to delete the tmp folder between the runs. 😓
Now there are no more mismatches in the comparison test.
P.S. Thanks for the PR

@oschwengers
Copy link
Author

Now, that we all see the same gene predictions between Prodigal and Pyrodigal, I'll close this issue. Just in case I've missed something, please do not hesitate to re-open it.

Great effort and results! Thanks a lot everyone! Can't wait to use Pyrodigal in our projects...

@althonos
Copy link
Owner

Thanks a lot for the extensive bug report and the comparison repository!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants