Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new metrics BoC and BoW #60

Closed
wants to merge 15 commits into from
Closed

Add new metrics BoC and BoW #60

wants to merge 15 commits into from

Conversation

b2m
Copy link
Contributor

@b2m b2m commented Jun 15, 2021

This pull requests adds 2 new metrics: Bag of Chars and Bag of Words to dinglehopper.

While introducing new metrics I also refactored several parts of dinglehopper to be able to handle several metrics without violating DRY.

Here is an overview on what has changed:

  • Remove Python 3.5 support (CircleCI) to be able to use classes derived from NamedTuples.
  • Use absolute imports to avoid circular import errors.
  • Remove multimethod as it was not used anymore.
  • Update CSS/JavaScript dependencies in the report files to use some new features from Bootstrap v5.
  • Change the reports to be able to handle several metrics.
  • Separated json and html report generation for performance reasons.
  • Added flag to turn off html report generation for performance reasons.
  • Changed metric parameter to be able to configure which metrics should be calculated.
  • Added BoC and BoW as metrics.
  • Moved normalization methods to separate file (DRY).
  • Expose more information from metrics in reports (accuracy, error rate, elements in gt, elements in ocr, weighted errors, weights) for extended analysis (e.g. other ways to calculate the error rate =).
  • Fix flake8 and mypy errors.

Note that these changes are not backward compatible, as the parameter metrics is no longer a flag and the json report format was changed. As dinglehopper does not have versions or releases (yet) I am not sure whether backward compatibility is an issue at this stage.

For future use a weights parameter already is enabled for boc and bow, but not for cer and wer.
This parameter is not exposed at cli level (yet).

I may add another pull request converting cer and wer to be able to use weights also.

New Reports

Here is a screenshot of the new html report:
report-simple

This is an extract from the new json report:

{
    "gt": "gt.txt",
    "ocr": "ocr.txt",
    "character_accuracy": {
        "weights": {
            "deletes": 1,
            "inserts": 1,
            "replacements": 1
        },
        "weighted_errors": 1,
        "reference_elements": 5,
        "compared_elements": 5,
        "accuracy": 0.8,
        "error_rate": 0.2
    },
    ...
}

A note on how BoC and BoW are calculated

  1. Imagine gt and ocr as sets.
  2. An element in gt and not in ocr is considered a delete.
  3. An element in ocr and not in gt is considered an insert.
gt, ocr = set(gt_text), set(ocr_text)
deletes = gt - ocr
inserts = ocr - gt
  1. Replacements are only calculated when the cost of a replacement is lower than that of a combined insert and delete. For replacements we are taking arbitrary pairs from gt and ocr until one is exhaustet.

This approach can get extended to use bags (aka multiset or collections.Counter in Python) where each element also has a count.

@b2m b2m changed the title Add new metric BoC and BoW Add new metrics BoC and BoW Jun 15, 2021
@mikegerber mikegerber self-assigned this Jun 17, 2021
@b2m b2m closed this Aug 13, 2021
@mikegerber
Copy link
Member

@b2m I want to reopen this, but your repo was deleted?

@kba
Copy link
Contributor

kba commented Mar 1, 2023

I know this PR is stopped (@b2m was that intentional?) but we would love to see an implementation of BoW in dinglehopper. Unfortunately, it is not as straightforward to agree on what we mean with "BoW", cf. discussion in OCR-D/spec#225.

@bertsky did a quick review of this PR in that context:

bag_accuracy itself looks good, but it's used by bag_of_words_accuracy and bag_of_chars_accuracy with Weights(1, 0, 1) – i.e. only deletes, but no inserts (note that replacements will be zero). It's wrong to call this accuracy (it lends itself to recall).

Also, MetricResult.error_rate seems to be wrong, because it uses the length of the reference as denominator (instead of length of reference plus length of compared). So accuracy would also be wrong.

Together these two mistakes compensate each other, so that implementation calculates a valid recall (complement of false negative rate).

I think most of the functionality is here to implement BoW recall and accuracy, we just need to agree on what we mean by which and adapt. And since @b2m has already made the effort, including tests and code organization, it is a good starting point. Maybe we can discuss this at some point soon @mikegerber @bertsky @mweidling @b2m?

@b2m
Copy link
Contributor Author

b2m commented Mar 1, 2023

I know this PR is stopped (@b2m was that intentional?)

As I personally communicated with @mikegerber in 2021 I closed the PR as I no longer had the time to maintain it (or the other PRs).

If you find the code reusable feel free to open the PR again and modify it to your needs.

I agree with @bertsky that in hindsight the naming is quite unfortunate.

@bertsky
Copy link
Contributor

bertsky commented Mar 1, 2023

@b2m one always has some kind of pursuit curve with these things. Your implementation is the cleanest I have seen so far, we'll gladly refactor and update – many thanks!

@mikegerber
Copy link
Member

@bertsky If you figure out how to get the code out of the closed PR, please do and open a fresh PR.

I agree mostly with everything and like to get this in. The HTML report might need some work and discussion.

@b2m
Copy link
Contributor Author

b2m commented Mar 2, 2023

@bertsky If you figure out how to get the code out of the closed PR, please do and open a fresh PR.

Pull requests are just branches in git, so you can check them out separately.

Assuming that the remote https://github.com/qurator-spk/dinglehopper/ is called origin and your new local branch should be boc-and-bow you can just use the following git command: git fetch origin pull/60/head:boc-and-bow

@bertsky
Copy link
Contributor

bertsky commented Mar 2, 2023

I'll try both, rebasing the PR's branch to master, or merging master into it, and see where I fare better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants