SingleByteCharSetProber.Reset() does not correctly reset #138

adimosh · 2022-02-02T19:47:49Z

Background

After forking this repository in order to customize the implementation for some of my own needs, I tried to cache the prober implementations, relying on the Reset method to make sure they're "clean" between runs. It was during running tests for this scenario that I found the bug.

What happens

Running probers multiple times results in an increased probability for certain previously-recognized charsets to have a higher confidence, and, therefore, to possibly overtake the correct encoding prober's confidence.

This is a result of the Reset() method of the SingleByteCharSetProber class, which resets state, lastOrder, seqCounters, totalSeqs, totalChar and freqChar back to their default values.

It does not, however, also reset ctrlChar to its default value. Confidence, therefore, grows slowly with each use of the prober.

Proposed solution

Add the line:

ctrlChar = 0;

...anywhere in the Reset method (possibly on line 204 of the /src/Core/Probers/SingleByteCharSetProber.cs file, for example).

Conclusion

The observed behaviour was that Windows-1250 and Windows-1252 became significantly more often-recognized than any others.

Once this is done, probers can be cached and reused, resulting in significantly fewer allocations, and less recognition bugs.

It is entirely possible that this might be the cause of a few of the issues currently outlined, like:

The text was updated successfully, but these errors were encountered:

304NotModified · 2022-02-03T23:00:40Z

Could you please send a pr? (E.g. just edit the file in github and propose the pull request)

304NotModified added the bug label May 17, 2022

304NotModified added this to the 2.6 milestone Jun 27, 2022

304NotModified mentioned this issue Jun 27, 2022

Fix Reset() of SingleByteCharSetProber #150

Merged

304NotModified closed this as completed in #150 Jun 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SingleByteCharSetProber.Reset() does not correctly reset #138

SingleByteCharSetProber.Reset() does not correctly reset #138

adimosh commented Feb 2, 2022

304NotModified commented Feb 3, 2022 •

edited

Loading

SingleByteCharSetProber.Reset() does not correctly reset #138

SingleByteCharSetProber.Reset() does not correctly reset #138

Comments

adimosh commented Feb 2, 2022

Background

What happens

Proposed solution

Conclusion

304NotModified commented Feb 3, 2022 • edited Loading

304NotModified commented Feb 3, 2022 •

edited

Loading