You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After forking this repository in order to customize the implementation for some of my own needs, I tried to cache the prober implementations, relying on the Reset method to make sure they're "clean" between runs. It was during running tests for this scenario that I found the bug.
What happens
Running probers multiple times results in an increased probability for certain previously-recognized charsets to have a higher confidence, and, therefore, to possibly overtake the correct encoding prober's confidence.
This is a result of the Reset() method of the SingleByteCharSetProber class, which resets state, lastOrder, seqCounters, totalSeqs, totalChar and freqChar back to their default values.
It does not, however, also reset ctrlChar to its default value. Confidence, therefore, grows slowly with each use of the prober.
Proposed solution
Add the line:
ctrlChar = 0;
...anywhere in the Reset method (possibly on line 204 of the /src/Core/Probers/SingleByteCharSetProber.cs file, for example).
Conclusion
The observed behaviour was that Windows-1250 and Windows-1252 became significantly more often-recognized than any others.
Once this is done, probers can be cached and reused, resulting in significantly fewer allocations, and less recognition bugs.
It is entirely possible that this might be the cause of a few of the issues currently outlined, like:
Background
After forking this repository in order to customize the implementation for some of my own needs, I tried to cache the prober implementations, relying on the Reset method to make sure they're "clean" between runs. It was during running tests for this scenario that I found the bug.
What happens
Running probers multiple times results in an increased probability for certain previously-recognized charsets to have a higher confidence, and, therefore, to possibly overtake the correct encoding prober's confidence.
This is a result of the Reset() method of the SingleByteCharSetProber class, which resets state, lastOrder, seqCounters, totalSeqs, totalChar and freqChar back to their default values.
It does not, however, also reset ctrlChar to its default value. Confidence, therefore, grows slowly with each use of the prober.
Proposed solution
Add the line:
ctrlChar = 0;
...anywhere in the Reset method (possibly on line 204 of the /src/Core/Probers/SingleByteCharSetProber.cs file, for example).
Conclusion
The observed behaviour was that Windows-1250 and Windows-1252 became significantly more often-recognized than any others.
Once this is done, probers can be cached and reused, resulting in significantly fewer allocations, and less recognition bugs.
It is entirely possible that this might be the cause of a few of the issues currently outlined, like:
The text was updated successfully, but these errors were encountered: