Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

only call words_normalized once #72

Merged
merged 17 commits into from
Jan 2, 2024
Merged

only call words_normalized once #72

merged 17 commits into from
Jan 2, 2024

Conversation

maxbachmann
Copy link
Contributor

@maxbachmann maxbachmann commented Aug 28, 2022

words_normalized should only be called once, since it is quite slow, which has a large effect now that the string matching is faster. On my laptop this achieves the following performance improvement:
Before:

[max@localhost dinglehopper]$ /usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
	0:15.89 real,	9.61 user,	7.14 sys,	92704 mmem

After:

[max@localhost dinglehopper]$ /usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
	0:12.56 real,	7.88 user,	5.56 sys,	92836 mmem

@mikegerber's task list:

  • Fix tests for this PR
  • Review error rate calculation (might also be faulty in edge cases on master. This produced NaN in test_cli_valid_json, check if it should have been infinity - according to our definition at least)
  • Review comment I made about the docstring
  • Review RapidFuzz dependency. The one we have may be need an update for this. (tests succeed with RapidFuzz >= 3 in a fresh venv)
  • Review the whole PR again
  • Rebase/merge into master

@maxbachmann
Copy link
Contributor Author

maxbachmann commented Aug 28, 2022

I did now move the grapheme cluster calculation into ExtractedText, so it only has to be calculated once for each sequences. Before it was calculated twice for one of the sequences and three times for the other one. This further improves the runtime to:

[max@localhost dinglehopper]$ /usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
	0:08.66 real,	6.04 user,	3.51 sys,	96860 mmem

So with this data I achieve nearly a 50% speedup.

@maxbachmann
Copy link
Contributor Author

maxbachmann commented Aug 29, 2022

I replaced uniseg with uniseg2. This is a fork of uniseg I just created, which is still Python only as well, but significantly improves the performance of all functions in uniseg. This improves the runtime of dinglehopper by another 50% and reduces the memory consumption by 7mb (instead of a 7.5mb sqlite table it only loads a less than 500kb of binary strings)

[max@localhost dinglehopper]$ /usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
	0:04.20 real,	4.06 user,	0.79 sys,	89472 mmem

@maxbachmann
Copy link
Contributor Author

I replaced uniseg with uniseg2.

I am currently in contact with the uniseg author in order to get these improvements into the library.

@maxbachmann
Copy link
Contributor Author

I opened a PR for the original library as well: https://bitbucket.org/emptypage/uniseg-py/pull-requests/1. Not sure when he will come around to review it though.

@maxbachmann
Copy link
Contributor Author

I improved the editops implementation further and added a score_hint, since we already know the edit distance, which improves the runtime further to:

[max@localhost dinglehopper]$ /usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
	0:02.47 real,	2.67 user,	0.71 sys,	89060 mmem

@maxbachmann
Copy link
Contributor Author

My performance improvements did now get added to uniseg, so it is not required to use the forked version anymore.

@mikegerber this is now ready for review

@mikegerber
Copy link
Member

Sorry for not reacting earlier, I had some health issues!

@maxbachmann
Copy link
Contributor Author

maxbachmann commented Dec 5, 2022

Sorry for not reacting earlier, I had some health issues!

No hurry :)
I hope you're better now

In PR gh-72, @maxbachmann introduced a new argument for ExtractedText(). Update the
corresponding tests.
@maxbachmann
Copy link
Contributor Author

@mikegerber it appears you renamed the files used in this PR, should I reapply my changes to the new master branch?

@mikegerber
Copy link
Member

@mikegerber it appears you renamed the files used in this PR, should I reapply my changes to the new master branch?

Yes, we had some issues with the namespaces of our packages. You can rebase if you want or I'll do it when I make time. Sorry I didn't merge it yet, I had some (presumably small) issues with it the last time I tried (tests failing). Still want it merged!

@mikegerber
Copy link
Member

Heads up: I'm working on merging this again.

I'm first looking into fixing running the tests using this PR, then I'm rebasing.

mikegerber added a commit that referenced this pull request Oct 23, 2023
We had some issues while reviewing/rebasing #72. We don't support Python 3.5 anymore,
so lifting the hard pin on multimethod 1.3.
@mikegerber
Copy link
Member

Note to self: Should also move on to RapidFuzz 3 if possible.

We had some issues while reviewing/rebasing #72. We don't support Python 3.5 anymore,
so lifting the hard pin on multimethod 1.3.
Newest OCR-D wasn't happy with the test data anymore (see gh-89). I'm not sure if the
test data was invalid the way it was, but having a LOCTYPE certainly is "prettier" so
adding it. This fixes the test again.
@mikegerber
Copy link
Member

mikegerber commented Oct 27, 2023

I'm first looking into fixing running the tests using this PR, then I'm rebasing.

After cherry-picking some commits from master into this PR (to fix errors and warnings we already had fixed), only one test (test_cli_json_cer_is_infinity) fails now. Working on fixing/reviewing that one.

(Not rebased yet.)

@mikegerber
Copy link
Member

Tests run now, but I want to review the edge cases again in the coming days. (error rate x count can be infinite or nan (if count is 0))

Copy link
Member

@mikegerber mikegerber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now. I've added a little check and edge case for glyph-level text.

qurator/dinglehopper/extracted_text.py Show resolved Hide resolved
@mikegerber
Copy link
Member

* [ ]  Review RapidFuzz dependency. The one we have may be need an update for this. (tests succeed with RapidFuzz >= 3 in a fresh venv)

2.7.0 (the minimum in this PR) runs fine, and 3.4.0 (current version) runs fine, too.

@mikegerber
Copy link
Member

drum rolls now really ready to merge, everything looks fine. I hope I can do it tomorro, and dive into "This branch has conflicts that must be resolved".

@maxbachmann
Copy link
Contributor Author

2.7.0 (the minimum in this PR) runs fine, and 3.4.0 (current version) runs fine, too.

Yes there shouldn't be any breaking changes that would affect the usage in dinglehopper

@mikegerber
Copy link
Member

2.7.0 (the minimum in this PR) runs fine, and 3.4.0 (current version) runs fine, too.
Yes there shouldn't be any breaking changes that would affect the usage in dinglehopper

@maxbachmann I'm being extra pedantic and check it because I've seen other software break because of a missing update of the requirements :)

@mikegerber
Copy link
Member

Ah the uniseg dependendy needs a minimum version now, I'll add it.

@maxbachmann also improved the performance of uniseg, and it is in 0.7.2 - update our
dependency.
@mikegerber
Copy link
Member

Ah the uniseg dependendy needs a minimum version now, I'll add it.

We don't need it but @maxbachmann improved the performance there too and we want the shiny speed.

(I hope the tests run after I manage to rebase this, apparently they don't here because relevant Actions config is not in this branch yet... Relevant because I only test manually on the latest Py 3.11.)

@mikegerber mikegerber merged commit 38fcbc8 into qurator-spk:master Jan 2, 2024
@mikegerber
Copy link
Member

FINALLY! It is merged! Something related to PR #83 broke, but I'll fix that soon, wanted to get this in after all this time.

🍾

@mikegerber
Copy link
Member

Something related to PR #83 broke, but I'll fix that soon, wanted to get this in after all this time.

That was just a drained generator, I fixed it in c168155.

@maxbachmann
Copy link
Contributor Author

Wow cool to finally see this in 🥳

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants