only call `words_normalized` once #72

maxbachmann · 2022-08-28T22:24:31Z

words_normalized should only be called once, since it is quite slow, which has a large effect now that the string matching is faster. On my laptop this achieves the following performance improvement:
Before:

[max@localhost dinglehopper]$ /usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
	0:15.89 real,	9.61 user,	7.14 sys,	92704 mmem

After:

[max@localhost dinglehopper]$ /usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
	0:12.56 real,	7.88 user,	5.56 sys,	92836 mmem

@mikegerber's task list:

Fix tests for this PR
Review error rate calculation (might also be faulty in edge cases on master. This produced NaN in test_cli_valid_json, check if it should have been infinity - according to our definition at least)
Review comment I made about the docstring
Review RapidFuzz dependency. The one we have may be need an update for this. (tests succeed with RapidFuzz >= 3 in a fresh venv)
Review the whole PR again
Rebase/merge into master

maxbachmann · 2022-08-28T23:53:42Z

I did now move the grapheme cluster calculation into ExtractedText, so it only has to be calculated once for each sequences. Before it was calculated twice for one of the sequences and three times for the other one. This further improves the runtime to:

[max@localhost dinglehopper]$ /usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
	0:08.66 real,	6.04 user,	3.51 sys,	96860 mmem

So with this data I achieve nearly a 50% speedup.

maxbachmann · 2022-08-29T20:11:13Z

I replaced uniseg with uniseg2. This is a fork of uniseg I just created, which is still Python only as well, but significantly improves the performance of all functions in uniseg. This improves the runtime of dinglehopper by another 50% and reduces the memory consumption by 7mb (instead of a 7.5mb sqlite table it only loads a less than 500kb of binary strings)

[max@localhost dinglehopper]$ /usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
	0:04.20 real,	4.06 user,	0.79 sys,	89472 mmem

maxbachmann · 2022-08-31T21:18:32Z

I replaced uniseg with uniseg2.

I am currently in contact with the uniseg author in order to get these improvements into the library.

maxbachmann · 2022-09-08T21:25:30Z

I opened a PR for the original library as well: https://bitbucket.org/emptypage/uniseg-py/pull-requests/1. Not sure when he will come around to review it though.

maxbachmann · 2022-09-11T00:39:49Z

I improved the editops implementation further and added a score_hint, since we already know the edit distance, which improves the runtime further to:

[max@localhost dinglehopper]$ /usr/bin/time -f '\t%E real,\t%U user,\t%S sys,\t%M mmem' dinglehopper gt.txt frak2021_0.905_1587027_9141630.txt
	0:02.47 real,	2.67 user,	0.71 sys,	89060 mmem

maxbachmann · 2022-10-12T16:54:56Z

My performance improvements did now get added to uniseg, so it is not required to use the forked version anymore.

@mikegerber this is now ready for review

mikegerber · 2022-12-05T14:39:52Z

Sorry for not reacting earlier, I had some health issues!

maxbachmann · 2022-12-05T14:42:16Z

Sorry for not reacting earlier, I had some health issues!

No hurry :)
I hope you're better now

@maxbachmann

In PR gh-72, @maxbachmann introduced a new argument for ExtractedText(). Update the corresponding tests.

maxbachmann · 2023-04-20T03:10:56Z

@mikegerber it appears you renamed the files used in this PR, should I reapply my changes to the new master branch?

mikegerber · 2023-04-21T08:46:03Z

@mikegerber it appears you renamed the files used in this PR, should I reapply my changes to the new master branch?

Yes, we had some issues with the namespaces of our packages. You can rebase if you want or I'll do it when I make time. Sorry I didn't merge it yet, I had some (presumably small) issues with it the last time I tried (tests failing). Still want it merged!

mikegerber · 2023-10-11T18:51:27Z

Heads up: I'm working on merging this again.

I'm first looking into fixing running the tests using this PR, then I'm rebasing.

We had some issues while reviewing/rebasing #72. We don't support Python 3.5 anymore, so lifting the hard pin on multimethod 1.3.

mikegerber · 2023-10-25T12:20:33Z

Note to self: Should also move on to RapidFuzz 3 if possible.

We had some issues while reviewing/rebasing #72. We don't support Python 3.5 anymore, so lifting the hard pin on multimethod 1.3.

Newest OCR-D wasn't happy with the test data anymore (see gh-89). I'm not sure if the test data was invalid the way it was, but having a LOCTYPE certainly is "prettier" so adding it. This fixes the test again.

See gh-75.

mikegerber · 2023-10-27T17:27:24Z

I'm first looking into fixing running the tests using this PR, then I'm rebasing.

After cherry-picking some commits from master into this PR (to fix errors and warnings we already had fixed), only one test (test_cli_json_cer_is_infinity) fails now. Working on fixing/reviewing that one.

(Not rebased yet.)

mikegerber · 2023-10-27T19:08:28Z

Tests run now, but I want to review the edge cases again in the coming days. (error rate x count can be infinite or nan (if count is 0))

…nite If the CER is infinite, we can't calculate a score_hint as an int. Fall back to None in this case.

…andling an empty joiner

mikegerber

LGTM now. I've added a little check and edge case for glyph-level text.

qurator/dinglehopper/tests/extracted_text_test.py

qurator/dinglehopper/extracted_text.py

mikegerber · 2023-10-31T19:57:39Z

* [ ]  Review RapidFuzz dependency. The one we have may be need an update for this. (tests succeed with RapidFuzz >= 3 in a fresh venv)

2.7.0 (the minimum in this PR) runs fine, and 3.4.0 (current version) runs fine, too.

mikegerber · 2023-10-31T19:59:09Z

drum rolls now really ready to merge, everything looks fine. I hope I can do it tomorro, and dive into "This branch has conflicts that must be resolved".

maxbachmann · 2023-10-31T20:02:54Z

2.7.0 (the minimum in this PR) runs fine, and 3.4.0 (current version) runs fine, too.

Yes there shouldn't be any breaking changes that would affect the usage in dinglehopper

mikegerber · 2023-10-31T20:08:43Z

2.7.0 (the minimum in this PR) runs fine, and 3.4.0 (current version) runs fine, too.
Yes there shouldn't be any breaking changes that would affect the usage in dinglehopper

@maxbachmann I'm being extra pedantic and check it because I've seen other software break because of a missing update of the requirements :)

mikegerber · 2023-11-01T12:46:11Z

Ah the uniseg dependendy needs a minimum version now, I'll add it.

@maxbachmann

@maxbachmann also improved the performance of uniseg, and it is in 0.7.2 - update our dependency.

mikegerber · 2023-11-01T12:53:34Z

Ah the uniseg dependendy needs a minimum version now, I'll add it.

We don't need it but @maxbachmann improved the performance there too and we want the shiny speed.

(I hope the tests run after I manage to rebase this, apparently they don't here because relevant Actions config is not in this branch yet... Relevant because I only test manually on the latest Py 3.11.)

mikegerber · 2024-01-02T19:25:30Z

FINALLY! It is merged! Something related to PR #83 broke, but I'll fix that soon, wanted to get this in after all this time.

🍾

mikegerber · 2024-01-03T18:22:56Z

Something related to PR #83 broke, but I'll fix that soon, wanted to get this in after all this time.

That was just a drained generator, I fixed it in c168155.

maxbachmann · 2024-01-05T07:42:21Z

Wow cool to finally see this in 🥳

maxbachmann added 5 commits August 29, 2022 00:22

only call words_normalized once

f3825cd

remove unused includes

205a969

remove python2.7 futures

f211d09

move grapheme clusters to ExtractedText

01571f2

apply black

22c3817

replace uniseg with uniseg2

a1f0a5e

update rapidfuzz version

d2bbc8a

use uniseg again

f48e305

🐛 Update tests for ExtractedText

a18b25b

In PR gh-72, @maxbachmann introduced a new argument for ExtractedText(). Update the corresponding tests.

This was referenced Feb 28, 2023

add version to ocrd-tool.json (and setup.py) #73

Closed

Release 1.0.0 #74

Open

mikegerber added this to the 1.0 milestone Mar 2, 2023

mikegerber self-assigned this Mar 2, 2023

mikegerber mentioned this pull request Mar 2, 2023

Improve performance when calculating sequence alignment #56

Closed

mikegerber mentioned this pull request Oct 11, 2023

Review multimethod dependency #88

Closed

mikegerber added a commit that referenced this pull request Oct 23, 2023

⬆ Update multimethod dependency

1c3b28d

We had some issues while reviewing/rebasing #72. We don't support Python 3.5 anymore, so lifting the hard pin on multimethod 1.3.

mikegerber mentioned this pull request Oct 23, 2023

⬆ Update multimethod dependency #93

Merged

⬆ Update multimethod dependency

7ed076d

We had some issues while reviewing/rebasing #72. We don't support Python 3.5 anymore, so lifting the hard pin on multimethod 1.3.

mikegerber added 2 commits October 27, 2023 18:48

✔ Add mets:FLocat's @LOCTYPE/OTHERLOCTYPE to test data

7fef02b

Newest OCR-D wasn't happy with the test data anymore (see gh-89). I'm not sure if the test data was invalid the way it was, but having a LOCTYPE certainly is "prettier" so adding it. This fixes the test again.

🕸Do not use deprecated ID, pageId options

bc95c03

See gh-75.

mikegerber added 4 commits October 31, 2023 19:01

🐛 Fix calculation of score_hint for edge cases, e.g. when CER is infi…

e256526

…nite If the CER is infinite, we can't calculate a score_hint as an int. Fall back to None in this case.

🐛 Fix docstring of distance() for grapheme clusters

618ea56

🐛 Fix score_hint call in cli_line_dirs

7c6ee59

❎ Make joining grapheme clusters more robust by checking joiner and h…

de6cd8f

…andling an empty joiner

mikegerber reviewed Oct 31, 2023

View reviewed changes

qurator/dinglehopper/tests/extracted_text_test.py Show resolved Hide resolved

qurator/dinglehopper/extracted_text.py Show resolved Hide resolved

⬆ Update uniseg dependency

68a12f8

@maxbachmann also improved the performance of uniseg, and it is in 0.7.2 - update our dependency.

mikegerber merged commit 38fcbc8 into qurator-spk:master Jan 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

only call `words_normalized` once #72

only call `words_normalized` once #72

maxbachmann commented Aug 28, 2022 •

edited by mikegerber

Loading

maxbachmann commented Aug 28, 2022 •

edited

Loading

maxbachmann commented Aug 29, 2022 •

edited

Loading

maxbachmann commented Aug 31, 2022

maxbachmann commented Sep 8, 2022

maxbachmann commented Sep 11, 2022

maxbachmann commented Oct 12, 2022

mikegerber commented Dec 5, 2022

maxbachmann commented Dec 5, 2022 •

edited

Loading

maxbachmann commented Apr 20, 2023

mikegerber commented Apr 21, 2023

mikegerber commented Oct 11, 2023

mikegerber commented Oct 25, 2023

mikegerber commented Oct 27, 2023 •

edited

Loading

mikegerber commented Oct 27, 2023

mikegerber left a comment

mikegerber commented Oct 31, 2023

mikegerber commented Oct 31, 2023

maxbachmann commented Oct 31, 2023

mikegerber commented Oct 31, 2023

mikegerber commented Nov 1, 2023

mikegerber commented Nov 1, 2023

mikegerber commented Jan 2, 2024

mikegerber commented Jan 3, 2024

maxbachmann commented Jan 5, 2024

only call words_normalized once #72

only call words_normalized once #72

Conversation

maxbachmann commented Aug 28, 2022 • edited by mikegerber Loading

maxbachmann commented Aug 28, 2022 • edited Loading

maxbachmann commented Aug 29, 2022 • edited Loading

maxbachmann commented Aug 31, 2022

maxbachmann commented Sep 8, 2022

maxbachmann commented Sep 11, 2022

maxbachmann commented Oct 12, 2022

mikegerber commented Dec 5, 2022

maxbachmann commented Dec 5, 2022 • edited Loading

maxbachmann commented Apr 20, 2023

mikegerber commented Apr 21, 2023

mikegerber commented Oct 11, 2023

mikegerber commented Oct 25, 2023

mikegerber commented Oct 27, 2023 • edited Loading

mikegerber commented Oct 27, 2023

mikegerber left a comment

Choose a reason for hiding this comment

mikegerber commented Oct 31, 2023

mikegerber commented Oct 31, 2023

maxbachmann commented Oct 31, 2023

mikegerber commented Oct 31, 2023

mikegerber commented Nov 1, 2023

mikegerber commented Nov 1, 2023

mikegerber commented Jan 2, 2024

mikegerber commented Jan 3, 2024

maxbachmann commented Jan 5, 2024

only call `words_normalized` once #72

only call `words_normalized` once #72

maxbachmann commented Aug 28, 2022 •

edited by mikegerber

Loading

maxbachmann commented Aug 28, 2022 •

edited

Loading

maxbachmann commented Aug 29, 2022 •

edited

Loading

maxbachmann commented Dec 5, 2022 •

edited

Loading

mikegerber commented Oct 27, 2023 •

edited

Loading