-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port multi-byte character ratio detection in UTF-8 prober confidence function from jschardet #117
Conversation
@rstm-sf do we think we should merge this one? |
{ | ||
for (int i = 0; i < numOfMBChar; i++) | ||
unlike *= ONE_CHAR_PROB; | ||
unlike *= (float)Math.Pow(ONE_CHAR_PROB, numOfMBChar); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, sorry for the delay, it took a while to understand this change.
This method can be simplified to the following state:
public override float GetConfidence(StringBuilder status = null)
{
const float like = 0.99f;
if (numOfMBChar >= 6)
return like;
var mbCharRatio = (float)mbCharLen / (fullLen - basicAsciiLen);
if (mbCharRatio > 0.6f)
return like;
var negative = (float)Math.Pow(ONE_CHAR_PROB, numOfMBChar * numOfMBChar);
return like * (1f - negative);
}
I found a partial explanation in this PR aadsm/jschardet#59
But this particular change is not entirely clear to me. Asked about it here aadsm/jschardet#57 (comment)
Also, I can't understand out why it works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference:aadsm/jschardet@f45b273#r47324263
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If ONE_CHAR_PROB = 0.45f
, then tests will start to pass too (instead of double pow)
(... I can't understand out why it works)
@@ -107,11 +119,17 @@ public override float GetConfidence(StringBuilder status = null) | |||
{ | |||
float unlike = 0.99f; | |||
float confidence; | |||
var mbCharRatio = 0.0f; | |||
var nonBasciAsciiLen = fullLen - basicAsciiLen; | |||
if (nonBasciAsciiLen > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that this is always true. Why could it be otherwise?
@yinyue200 do you think you could check the review comments? Or should be close this PR for now? |
fix #108