-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CER calculation: divide distance by length of alignment path #21
Comments
Same for
and
and probably also
|
Levenshtein distance is indeed symmmetric but the texts here are not on equal footing: one is the groud-truth while the other is the output. By normalizing to the max one will get a 0% CER if the reference text is 'AB' and the output is "ABBBBBBB...". Normalizing to the ground-truth makes more sense here (an empty output will be a 100% CER, as expected, instead of infinite). |
In the limit, yes. But that figure is the correct one – or would you rather have 100% then? EDIT sorry, I actually meant 0% accuracy (as being correct in the limit). So, I must contradict you: normalizing to the max will get a 100% CER in the limit.
Empty output will also become 0% for the max-length denominator, not infinite. EDIT Here I also meant accuracy, not CER. Empty output will also be 100% CER. And the asymmetrical denominator can yield larger than 100% rates. |
They might have the first two tokens in common but since the longer one has lots of additions, in my intuition this should register as 100% incorrect. Such "more errors than tokens" cases where
|
That's what I was arguing for (as being the only correct implementation).
What do you mean, 100% accuracy or 100% error rate? (My stance is that the more |
I mean error rate.
I think it's counter-intuitive, that however many spurious extra tokens are in the detection, CER will always be less than 100%. Also it converges towards 1, not 0, doesn't it? Because the In fact, what about
seems reasonable but then again
perhaps not. |
Actually,
|
Yes, sorry, I was too sloppy with my writing, I was really thinking of accuracy, not CER (see edits above). So we are actually in agreement :-)
That's not the same, though. You cannot just arbitrarily saturate at 100%. The asymmetrical denominator is already biased, clipping does not unbias it.
Same as above (biased: this would make the range close to one much more likely than it is). |
That's nothing else than accuracy = 1 - CER, though – in the (clipped) asymmetrical denominator implementation. |
I am wondering though, if the correct denominator really is not the max-length, but the length of alignment positions/operations (i.e. number of insertions plus deletions plus substitutions plus identities in the minimal alignment). (This seems to be the implementation chosen in ISRI tools.) |
Rice's dissertation (https://www.stephenvrice.com/images/rice-dissertation.pdf) argues that the accuracy can be negative. In the same manner, I would argue that the CER can be >1 or even infinite. I would however clamp it to 1 for practical purposes. Why would the error rate be symmetrical? It's only similar to a the symmetrical distance, but the compared texts have clear roles: the GT reference and the compared text. |
Sure you can define it that way, but that would not be interpretable and thus not useful anymore. (Clipping does not make it more interpretable, only more biased.) It now seems clear to me that definition was simply a misconception (on Rice's part and possibly followers). The section does not even discuss the alternatives of using maximum sequence length or simply alignment path length. The latter is the natural definition here, as it directly maps to the [0,1] range.
Because that rate is meant to approximate the probability of an average character being misrecognized, and that can (when assuming ergodicity) be expressed in terms of the distinct number of errors – but since the latter is (in the Levenshtein case) strictly symmetric, so the former must be, too. |
@mikegerber Thank you for pointing back to the original dissertation from Rice! His accuracy formula on p25 matches exactly what I tried to express above. It's completely evident for me to put it this way. Regarding the possibility of getting negative accuracy: I've seen this in some cases, therefore the restriction if one encounters more errors than GT has characters at all, which is interpreted as
Though one must consider Rice himself in the OCR-System Benchmarks (cf. The Fifth Annual Test of OCR Accuracy) removed any results / systems which fell below a certain limit of accuracy 90% (!) and didn't try to make sense on corner cases like above. |
This is not about corner cases, though. CER is meant / intended as an empirical estimate of the probabilty of misrecognising a random character, assuming ergodicity. Using a biased denominator makes for a biased estimate, period. You usually aggregate CER as an arithmetical average over many pairs of strings (typically lines), and that means even a single line where the OCR text is longer than the GT text (i.e. a single CER beyond 1 – possibly many times larger) will distort the whole average arbitrarily. Later formulations rightfully deviate from Rice by defining the denominator as |
In the current implementation, the denominator of the CER calculation is the reference (first) text's length:
ocrevalUAtion/src/main/java/eu/digitisation/output/ErrorMeasure.java
Line 53 in 84f15b8
But if you divide by the reference length instead of the maximum length, you can get CER contributions > 1.
AFAIK this is not correct. The Levenshtein edit distance is symmetrical, and so should be the rate calculation based on it.
The text was updated successfully, but these errors were encountered: