-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
only call words_normalized
once
#72
Conversation
I did now move the grapheme cluster calculation into ExtractedText, so it only has to be calculated once for each sequences. Before it was calculated twice for one of the sequences and three times for the other one. This further improves the runtime to:
So with this data I achieve nearly a 50% speedup. |
I replaced uniseg with uniseg2. This is a fork of uniseg I just created, which is still Python only as well, but significantly improves the performance of all functions in uniseg. This improves the runtime of dinglehopper by another 50% and reduces the memory consumption by 7mb (instead of a 7.5mb sqlite table it only loads a less than 500kb of binary strings)
|
I am currently in contact with the uniseg author in order to get these improvements into the library. |
I opened a PR for the original library as well: https://bitbucket.org/emptypage/uniseg-py/pull-requests/1. Not sure when he will come around to review it though. |
I improved the editops implementation further and added a
|
My performance improvements did now get added to uniseg, so it is not required to use the forked version anymore. @mikegerber this is now ready for review |
Sorry for not reacting earlier, I had some health issues! |
No hurry :) |
In PR gh-72, @maxbachmann introduced a new argument for ExtractedText(). Update the corresponding tests.
@mikegerber it appears you renamed the files used in this PR, should I reapply my changes to the new master branch? |
Yes, we had some issues with the namespaces of our packages. You can rebase if you want or I'll do it when I make time. Sorry I didn't merge it yet, I had some (presumably small) issues with it the last time I tried (tests failing). Still want it merged! |
Heads up: I'm working on merging this again. I'm first looking into fixing running the tests using this PR, then I'm rebasing. |
We had some issues while reviewing/rebasing #72. We don't support Python 3.5 anymore, so lifting the hard pin on multimethod 1.3.
Note to self: Should also move on to RapidFuzz 3 if possible. |
We had some issues while reviewing/rebasing #72. We don't support Python 3.5 anymore, so lifting the hard pin on multimethod 1.3.
Newest OCR-D wasn't happy with the test data anymore (see gh-89). I'm not sure if the test data was invalid the way it was, but having a LOCTYPE certainly is "prettier" so adding it. This fixes the test again.
After cherry-picking some commits from master into this PR (to fix errors and warnings we already had fixed), only one test (test_cli_json_cer_is_infinity) fails now. Working on fixing/reviewing that one. (Not rebased yet.) |
Tests run now, but I want to review the edge cases again in the coming days. (error rate x count can be infinite or nan (if count is 0)) |
…nite If the CER is infinite, we can't calculate a score_hint as an int. Fall back to None in this case.
…andling an empty joiner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now. I've added a little check and edge case for glyph-level text.
2.7.0 (the minimum in this PR) runs fine, and 3.4.0 (current version) runs fine, too. |
drum rolls now really ready to merge, everything looks fine. I hope I can do it tomorro, and dive into "This branch has conflicts that must be resolved". |
Yes there shouldn't be any breaking changes that would affect the usage in dinglehopper |
@maxbachmann I'm being extra pedantic and check it because I've seen other software break because of a missing update of the requirements :) |
Ah the uniseg dependendy needs a minimum version now, I'll add it. |
@maxbachmann also improved the performance of uniseg, and it is in 0.7.2 - update our dependency.
We don't need it but @maxbachmann improved the performance there too and we want the shiny speed. (I hope the tests run after I manage to rebase this, apparently they don't here because relevant Actions config is not in this branch yet... Relevant because I only test manually on the latest Py 3.11.) |
FINALLY! It is merged! Something related to PR #83 broke, but I'll fix that soon, wanted to get this in after all this time. 🍾 |
Wow cool to finally see this in 🥳 |
words_normalized
should only be called once, since it is quite slow, which has a large effect now that the string matching is faster. On my laptop this achieves the following performance improvement:Before:
After:
@mikegerber's task list: