Inconsistent gene predictions for Pyrodigal and fixed Prodigal #21

oschwengers · 2022-10-23T14:31:19Z

Hi and thanks so much for working on this. A patched, accelerated and potentially multi-threaded Prodigal alternative is very much needed and your effort very much appreciated!

We considered Pyrodigal as a promising alternative for Prodigal within Bakta and Platon. To add an extra layer of confidence we started a gene prediction comparison on 49 bacterial genomes. The large majority of all genes are equally predicted but unfortunately, some prediction slightly differ and some genes are not predicted by Pyrodigal at all.

We waited to double check everything but as I saw your issue #19 today, I thought this could maybe help to further debug it. @jhahnfeld has conducted the comparison and could provide more info on this.

Here are the results using a locally compiled and patched Prodigal version:

genome	strand	prodigal_start	prodigal_stop	prodigal_edge	pyrodigal_start	pyrodigal_stop	pyrodigal_edge
GCF_000008865	-1	2727810	2728103	False
GCF_000009045	-1	955653	955817	False
GCF_000020025	1	1689342	1689500	False
GCF_000022565	-1	3550577	3553432	False
GCF_000067165	1	2867204	2867602	False	2867036	2867602	False
GCF_000195955	1	3960860	3961786	False	3961052	3961786	False
GCF_000195955	1	2052933	2053289	False	2052969	2053289	False
GCF_000218265	-1	1551162	1551524	False
GCF_000222975	-1	1950139	1950465	False
GCF_000987925	-1	331455	331595	False
GCF_001457455	1				2243720	2243851	False

The text was updated successfully, but these errors were encountered:

althonos · 2022-10-24T12:52:32Z

Hi Oliver, thanks a lot for reaching out, I'm happy to hear you're considering Pyrodigal for super useful tools like these! I've been advocating to Bakta for some time 😉

I'm very thankful for the test set and the gene prediction comparison analysis, to be fair this is something I should have done myself earlier. I'm not yet entirely sure whether it's related to #19 or not because #19 actually affects the already computed genes by recording the wrong scores for display, but it shouldn't change the prediction of the genes themselves. Nevertheless I'll have a look at these genomes, it could be that in the v1 refactor I did some mistakes while adapting the original code.

Just so that I can reproduce, is this in meta or single mode? I also recently found out that Prodigal was doing some out-of-bound reads of sequence data while training, so I'll see about patching that, but I don't think that this is causing the discrepancies you've been seeing.

althonos · 2022-10-24T15:18:12Z

Just found the comparison repository, thanks @jhahnfeld for setting this up!

jhahnfeld · 2022-10-24T15:40:20Z

Hi Martin,

I just wanted to provide you with the link to the comparison repository, but you already found it :)
It now has a results folder with the current differences.

Just so that I can reproduce, is this in meta or single mode?

We used the single mode.

althonos · 2022-10-24T15:48:39Z

I just tested the comparison by using Pyrodigal where instead of training I loaded the training file from Prodigal: the results matched 100%. So it looks like Pyrodigal has some discrepancies in the training process, I'll have a look ASAP!

jhahnfeld · 2022-10-25T08:49:49Z

I updated the prediction mismatches and removed the '-c' flag since the tested genomes are closed. Now there are some more mismatches and completely missing predictions.

zdk123 · 2022-10-25T14:17:31Z

I just want to +1 this effort. Finding a few [minor] differences in our tests as well in meta mode.

althonos · 2022-10-25T20:48:16Z

@zdk123 : If you have some example sequences with inconsistencies in meta mode please share them here! I have identified a bug in the training process that I am working on fixing, but meta mode shouldn't be concerned, so if you see discrepancies there too that another bug that i have to fix 😅

althonos · 2022-10-26T02:21:28Z

Just shipped pre-release 2.0.0-rc.2, which seems to have fixed the issues on closed=True, where I'm not seeing any mismatches anymore. Problem was coming from a weird behaviour of Prodigal when in training mode that I didn't replicate exactly in my port of the score_connection function. (I'm actually thinking that's a bug in the original codebase, where a maximum search algorithm is not actually recording the maximum when in training mode).

Still seeing some mismatches in closed=False however, so I'll keep chasing bugs 🥲

zdk123 · 2022-10-26T15:57:34Z

@althonos Actually I think this may have been the same as issue #19 v2-rc2 seems to have fixed

althonos · 2022-10-27T00:08:06Z

Okay, found and patched a new bug (#22) which fixes the remaining issues on closed=False. On my machine the new v2.0.0-rc.3 release passes all comparison tests from @jhahnfeld's repository with both closed=True and closed=False. Please give it a try, if all is well I'll make a proper v2.0.0 after adding a bit of documentation and some minor fixes!

oschwengers · 2022-10-27T08:52:32Z

Thanks a lot for this. We'll have a look at this ASAP.
Just out of curiosity: I'm wondering if this update implies larger code changes that would qualify for a major version bump to 2.0.0? IMHO a minor bump (1.2.0) would be sufficient, too if there are no breaking API changes. This might also help to keep in sync with a potentially upcoming Prodigal 2.0 version.

althonos · 2022-10-27T10:07:06Z

@oschwengers : There are breaking changes coming from #18 because I'm going to add a mandatory argument to all Genes.write_* methods. In addition I'll also probably change some attributes of TrainingInfo that return copies to return views of the underlying buffers. That's why I'm trying to synchronize the release a bit this time, so that I can put all the API breaking changes in there at once.

In general choosing the versioning scheme is always complicated for bindings, but I think in here I'm just going to let Pyrodigal have its own versioning scheme following semver :) And actually Prodigal is at version 2.6.3, so this leaves the door open for Prodigal 3.0.

oschwengers · 2022-10-27T11:29:36Z

Ahh, I see. OK, that makes totally sense. Thanks!

jhahnfeld · 2022-10-27T14:17:45Z

Hi, thank you for looking into the issues.
I have tested Pyrodigal v2.0.0-rc.3 and still have two mismatches in GCF_000008865 with closed=False. For closed=True, I am getting new mismatches now that were not reported before (I have also added the report of mismatched stop codons.).
As for the potential bug in the score_connection function, it might be good to report it upstream to check if the behavior is intended or not.

althonos · 2022-10-27T14:42:33Z

@jhahnfeld : I'm not seeing any mismatch in closed=True anymore on my end. Are you deleting the tmp folder between runs in closed=False and closed=True modes? If you don't, Prodigal will not perform the training if the training file already exists, so you will end up using the wrong training file. See PR : jhahnfeld/prodigal-pyrodigal-comparison#1

jhahnfeld · 2022-10-27T15:06:20Z

Sorry, I forgot to delete the tmp folder between the runs. 😓
Now there are no more mismatches in the comparison test.
P.S. Thanks for the PR

oschwengers · 2022-10-28T06:47:04Z

Now, that we all see the same gene predictions between Prodigal and Pyrodigal, I'll close this issue. Just in case I've missed something, please do not hesitate to re-open it.

Great effort and results! Thanks a lot everyone! Can't wait to use Pyrodigal in our projects...

althonos · 2022-10-28T10:13:54Z

Thanks a lot for the extensive bug report and the comparison repository!

althonos added the bug Something isn't working label Oct 24, 2022

althonos mentioned this issue Oct 26, 2022

Duplicate extraction of edge nodes on genomes with a start codon on edge #22

Closed

althonos added this to the v2.0 milestone Oct 26, 2022

oschwengers closed this as completed Oct 28, 2022

althonos mentioned this issue Sep 5, 2023

different result comparing to prodigal #42

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent gene predictions for Pyrodigal and fixed Prodigal #21

Inconsistent gene predictions for Pyrodigal and fixed Prodigal #21

oschwengers commented Oct 23, 2022

althonos commented Oct 24, 2022

althonos commented Oct 24, 2022

jhahnfeld commented Oct 24, 2022

althonos commented Oct 24, 2022 •

edited

Loading

jhahnfeld commented Oct 25, 2022

zdk123 commented Oct 25, 2022

althonos commented Oct 25, 2022

althonos commented Oct 26, 2022 •

edited

Loading

zdk123 commented Oct 26, 2022

althonos commented Oct 27, 2022

oschwengers commented Oct 27, 2022

althonos commented Oct 27, 2022 •

edited

Loading

oschwengers commented Oct 27, 2022

jhahnfeld commented Oct 27, 2022

althonos commented Oct 27, 2022

jhahnfeld commented Oct 27, 2022

oschwengers commented Oct 28, 2022

althonos commented Oct 28, 2022

Inconsistent gene predictions for Pyrodigal and *fixed* Prodigal #21

Inconsistent gene predictions for Pyrodigal and *fixed* Prodigal #21

Comments

oschwengers commented Oct 23, 2022

althonos commented Oct 24, 2022

althonos commented Oct 24, 2022

jhahnfeld commented Oct 24, 2022

althonos commented Oct 24, 2022 • edited Loading

jhahnfeld commented Oct 25, 2022

zdk123 commented Oct 25, 2022

althonos commented Oct 25, 2022

althonos commented Oct 26, 2022 • edited Loading

zdk123 commented Oct 26, 2022

althonos commented Oct 27, 2022

oschwengers commented Oct 27, 2022

althonos commented Oct 27, 2022 • edited Loading

oschwengers commented Oct 27, 2022

jhahnfeld commented Oct 27, 2022

althonos commented Oct 27, 2022

jhahnfeld commented Oct 27, 2022

oschwengers commented Oct 28, 2022

althonos commented Oct 28, 2022

Inconsistent gene predictions for Pyrodigal and fixed Prodigal #21

Inconsistent gene predictions for Pyrodigal and fixed Prodigal #21

althonos commented Oct 24, 2022 •

edited

Loading

althonos commented Oct 26, 2022 •

edited

Loading

althonos commented Oct 27, 2022 •

edited

Loading