-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect encoding by looking for specific markers #41
Comments
That’s a good question and I did think about it. The reason I ended up not doing it is because it’s not uncommon for the encoding to specify X and then the file actually being stored in Y. It’s a common source of bugs. In this case I would prefer to defer this type of bug detection to an IDE so they could inform the user about it. I’m happy to re-evalute this though. I’ve noticed you’re working on VSCode, so you could provide a better insight on this. My assumption was also that IDEs would prefer to know the actual byte encoding instead of relying on metadata. Otherwise it’d be impossible (I guess I could always add an option to opt-out of metadata usage) for the IDE to know the real encoding. |
Yeah I brought this up because we got some reports from users asking for this feature and we use jschardet when users have enabled auto-guessing of encoding. Maybe this could be an option in jschardet that is not enabled by default. I see the issue with the actual encoding being different from what is set in the file by the user. On the other hand, isn't the encoding always a guess that can be wrong? So maybe using the hints that are in the file is not a bad idea (at least optionally). |
Yeah, the encoding is always a guess but on the premise that it uses the bytes and not the metadata (I created this library originally to detect cyrilic encoding in ID3 tags that reported wrong metadata). I’m happy to have this feature though. I currently don’t have the time to implement it (maybe during christmas vacation though), do you know if anyone is interested in coding this? I can provide mentoring and guidance if needed. |
Have a nice time! I have a similar problem, but "from the other side": developer uses [email protected] (hi says) and time-to-time (not allways!) they have "MacCyrillic-instead-of-Win1251" error with my files. But headers and bobys are 1251. Do you have sandbox to verify my files (probably I can make any error in my files)? Sencirelly yours, Dmitry PS Merry Christmass! :-) |
What do you mean a sandbox to verify your files? |
jschardet does its job correctly. "hey, here there's written that the file is ISO8859 but jscharded is confident that what's in the file is encoded in UTF-8. Moreover because the user might have the "encoding autodetect" in vscode set to false. And guess what? you don't even use the "encoding detection" during the "replace all" so if i want to replace a tag in an XML file (which doesn't even have chars outside the [a-zA-Z]) "tagA" with "tagB" you do the process above of ignoring everything and damaging ALL the user files at once (because you open the whole file and save the whole file) But when dozens of users open the bug report in VSCode (because it is) you classify it as a "feature request". 🤡 |
I was wondering if jschardet would ever consider to understand specific markers within a file to get the encoding from. For example, XML can have an encoding in the header:
<?xml version="1.0" encoding="windows-1251"?>
and HTML as well:
<meta charset="..."/>
There may be other languages where this exists too.
Refs: microsoft/vscode#36230
The text was updated successfully, but these errors were encountered: