-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new metrics BoC and BoW #60
Conversation
Also some refactoring for helper methods on normalization and word splitting.
@b2m I want to reopen this, but your repo was deleted? |
I know this PR is stopped (@b2m was that intentional?) but we would love to see an implementation of BoW in dinglehopper. Unfortunately, it is not as straightforward to agree on what we mean with "BoW", cf. discussion in OCR-D/spec#225. @bertsky did a quick review of this PR in that context:
I think most of the functionality is here to implement BoW recall and accuracy, we just need to agree on what we mean by which and adapt. And since @b2m has already made the effort, including tests and code organization, it is a good starting point. Maybe we can discuss this at some point soon @mikegerber @bertsky @mweidling @b2m? |
As I personally communicated with @mikegerber in 2021 I closed the PR as I no longer had the time to maintain it (or the other PRs). If you find the code reusable feel free to open the PR again and modify it to your needs. I agree with @bertsky that in hindsight the naming is quite unfortunate. |
@b2m one always has some kind of pursuit curve with these things. Your implementation is the cleanest I have seen so far, we'll gladly refactor and update – many thanks! |
@bertsky If you figure out how to get the code out of the closed PR, please do and open a fresh PR. I agree mostly with everything and like to get this in. The HTML report might need some work and discussion. |
Pull requests are just branches in git, so you can check them out separately. Assuming that the remote https://github.com/qurator-spk/dinglehopper/ is called |
I'll try both, rebasing the PR's branch to master, or merging master into it, and see where I fare better. |
This pull requests adds 2 new metrics: Bag of Chars and Bag of Words to dinglehopper.
While introducing new metrics I also refactored several parts of dinglehopper to be able to handle several metrics without violating DRY.
Here is an overview on what has changed:
Note that these changes are not backward compatible, as the parameter
metrics
is no longer a flag and the json report format was changed. As dinglehopper does not have versions or releases (yet) I am not sure whether backward compatibility is an issue at this stage.For future use a weights parameter already is enabled for boc and bow, but not for cer and wer.
This parameter is not exposed at cli level (yet).
I may add another pull request converting cer and wer to be able to use weights also.
New Reports
Here is a screenshot of the new html report:

This is an extract from the new json report:
A note on how BoC and BoW are calculated
This approach can get extended to use bags (aka multiset or collections.Counter in Python) where each element also has a count.