-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File detected as Windows-1250, but is UTF-8 #108
Comments
Hello, @tobbi ! Thank you for the report. Could you add a text file? Why did you choose zip? Do you submit this to input? |
Sorry, my bad, it used to be a csv file and github wouldn't accept those. Here's the file with the extension changed to .txt: |
Thanks for clarifying. At first glance, I think the result is normal. Why? The algorithm by which detected is statistical, and, accordingly, the more different input data, the more accurate the final result. Details can be found in the "A composite approach to language/encoding detection" article. But, we need to try to improve the result :) Status Logs: SBCS: Detected windows-1250 with confidence of 0.7738685Get confidence: SBCS 0.01: [koi8-r] SBCS 0: [iso-8859-5] SBCS 0.01: [x-mac-cyrillic] SBCS 0.01: [ibm866] SBCS 0.01: [ibm855] SBCS 0.18598664: [iso-8859-7] SBCS 0.18598664: [windows-1253] SBCS 0: [iso-8859-5] SBCS 0.01: [windows-1251] SBCS 0: [windows-1255] SBCS 0: [windows-1255] SBCS 0: [windows-1255] SBCS 0.09991017: [tis-620] SBCS 0.09991017: [iso-8859-11] SBCS 0.7133932: [iso-8859-1] SBCS 0.6674997: [iso-8859-15] SBCS 0.7133932: [windows-1252] SBCS 0.71340704: [iso-8859-1] SBCS 0.67082536: [iso-8859-15] SBCS 0.71340704: [windows-1252] SBCS 0.6861101: [iso-8859-2] SBCS 0.6861101: [windows-1250] SBCS 0.76677626: [iso-8859-1] SBCS 0.76677626: [windows-1252] SBCS inactive: [iso-8859-3] (i.e. confidence is too low). SBCS inactive: [iso-8859-6] (i.e. confidence is too low). SBCS 0.40016073: [viscii] SBCS 0.44124976: [windows-1258] SBCS 0.71854687: [iso-8859-15] SBCS 0.7641578: [iso-8859-1] SBCS 0.7641578: [windows-1252] SBCS 0.71640146: [iso-8859-13] SBCS 0.6377162: [iso-8859-10] SBCS 0.6736411: [iso-8859-4] SBCS 0.71818155: [iso-8859-13] SBCS 0.6363546: [iso-8859-10] SBCS 0.6753149: [iso-8859-4] SBCS 0.666065: [iso-8859-1] SBCS 0.666065: [iso-8859-9] SBCS 0.62630904: [iso-8859-15] SBCS 0.666065: [windows-1252] SBCS inactive: [iso-8859-3] (i.e. confidence is too low). SBCS 0.6366351: [iso-8859-2] SBCS 0.72143143: [x-mac-ce] SBCS 0.72143143: [ibm852] SBCS 0.6434225: [windows-1250] SBCS 0.64008415: [iso-8859-2] SBCS 0.7291228: [x-mac-ce] SBCS 0.7253399: [ibm852] SBCS 0.58494663: [windows-1250] SBCS 0.5881849: [iso-8859-2] SBCS 0.61615247: [iso-8859-13] SBCS 0.58494663: [iso-8859-16] SBCS 0.66285837: [x-mac-ce] SBCS 0.65958494: [ibm852] SBCS 0.7628341: [iso-8859-1] SBCS 0.71730226: [iso-8859-4] SBCS 0.71730226: [iso-8859-9] SBCS 0.7628341: [iso-8859-13] SBCS 0.71730226: [iso-8859-15] SBCS 0.7628341: [windows-1252] SBCS 0.76252055: [iso-8859-1] SBCS inactive: [iso-8859-3] (i.e. confidence is too low). SBCS 0.71700746: [iso-8859-15] SBCS 0.76252055: [windows-1252] SBCS 0.6695262: [windows-1250] SBCS 0.6695262: [iso-8859-2] SBCS 0.7052443: [iso-8859-13] SBCS 0.6695262: [iso-8859-16] SBCS 0.7587035: [x-mac-ce] SBCS 0.7587035: [ibm852] SBCS 0.76380235: [windows-1252] SBCS 0.76380235: [windows-1257] SBCS 0.71821266: [iso-8859-4] SBCS 0.76380235: [iso-8859-13] SBCS 0.71821266: [iso-8859-15] SBCS 0.6575037: [iso-8859-1] SBCS 0.6575037: [iso-8859-9] SBCS 0.61825883: [iso-8859-15] SBCS 0.6575037: [windows-1252] SBCS 0.7738685: [windows-1250] SBCS 0.7738685: [iso-8859-2] SBCS 0.7738685: [iso-8859-16] SBCS 0.75962406: [ibm852] SBCS 0.66994256: [windows-1250] SBCS 0.66994256: [iso-8859-2] SBCS 0.66994256: [iso-8859-16] SBCS 0.75917524: [x-mac-ce] SBCS 0.75917524: [ibm852] SBCS 0.76376295: [iso-8859-1] SBCS 0.7181756: [iso-8859-4] SBCS 0.76376295: [iso-8859-9] SBCS 0.7181756: [iso-8859-15] SBCS 0.76376295: [windows-1252] SBCS Group found best match [windows-1250] confidence 0.7738685. MBCS: Detected utf-8 with confidence of 0.7525Get confidence: MBCS 0.01: [shift-jis] MBCS 0.01: [euc-jp] MBCS 0.01: [gb18030] MBCS 0.01: [euc-kr] MBCS 0.01: [cp949] MBCS 0.01: [big5] MBCS inactive: euc-tw (i.e. confidence is too low). Latin1Prober: Detected windows-1252 with confidence of 0.43269232Latin1Prober: 0.43269232 [windows-1252] |
I'm using UTF.Unknown 2.3.0
The following file is detected as Windows-1250, but is UTF-8:
csv_test_correct_GZ.zip
The text was updated successfully, but these errors were encountered: