Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect encoding by looking for specific markers #41

Open
bpasero opened this issue Nov 18, 2017 · 6 comments
Open

Detect encoding by looking for specific markers #41

bpasero opened this issue Nov 18, 2017 · 6 comments

Comments

@bpasero
Copy link
Contributor

bpasero commented Nov 18, 2017

I was wondering if jschardet would ever consider to understand specific markers within a file to get the encoding from. For example, XML can have an encoding in the header:

<?xml version="1.0" encoding="windows-1251"?>

and HTML as well:

<meta charset="..."/>

There may be other languages where this exists too.

Refs: microsoft/vscode#36230

@aadsm
Copy link
Owner

aadsm commented Nov 18, 2017

That’s a good question and I did think about it. The reason I ended up not doing it is because it’s not uncommon for the encoding to specify X and then the file actually being stored in Y. It’s a common source of bugs. In this case I would prefer to defer this type of bug detection to an IDE so they could inform the user about it.

I’m happy to re-evalute this though. I’ve noticed you’re working on VSCode, so you could provide a better insight on this. My assumption was also that IDEs would prefer to know the actual byte encoding instead of relying on metadata. Otherwise it’d be impossible (I guess I could always add an option to opt-out of metadata usage) for the IDE to know the real encoding.

@bpasero
Copy link
Contributor Author

bpasero commented Nov 19, 2017

Yeah I brought this up because we got some reports from users asking for this feature and we use jschardet when users have enabled auto-guessing of encoding. Maybe this could be an option in jschardet that is not enabled by default.

I see the issue with the actual encoding being different from what is set in the file by the user. On the other hand, isn't the encoding always a guess that can be wrong? So maybe using the hints that are in the file is not a bad idea (at least optionally).

@aadsm
Copy link
Owner

aadsm commented Nov 19, 2017

Yeah, the encoding is always a guess but on the premise that it uses the bytes and not the metadata (I created this library originally to detect cyrilic encoding in ID3 tags that reported wrong metadata).

I’m happy to have this feature though. I currently don’t have the time to implement it (maybe during christmas vacation though), do you know if anyone is interested in coding this? I can provide mentoring and guidance if needed.

@OneLonelly
Copy link

Have a nice time!

I have a similar problem, but "from the other side": developer uses [email protected] (hi says) and time-to-time (not allways!) they have "MacCyrillic-instead-of-Win1251" error with my files. But headers and bobys are 1251. Do you have sandbox to verify my files (probably I can make any error in my files)?

Sencirelly yours, Dmitry
[email protected]

PS Merry Christmass! :-)

@aadsm
Copy link
Owner

aadsm commented Jan 16, 2018

What do you mean a sandbox to verify your files?
You can use runkit to test the library: https://npm.runkit.com/jschardet

@LinoBarreca
Copy link

LinoBarreca commented Oct 12, 2021

jschardet does its job correctly.
It detects the most likely encoding from the character codes present in the string.
The bug is actually in VSCode because it still keeps asking jschardet completely ignoring the files he's supposed to handle.
The XML is a standard.
If the standard says that the encoding is specified in a certain way, the encoding is specified.
No jschardet is even needed. The IDE has to use it or, at most, if it's a good IDE tell the user

"hey, here there's written that the file is ISO8859 but jscharded is confident that what's in the file is encoded in UTF-8.
Did you by any chance save this file with a previous version of Visual Studio Code and it fucked your file up?
Do you want me to fix the file for you reconverting back all the characters as they were before VSCode messed with them without you even knowing it?"

Moreover because the user might have the "encoding autodetect" in vscode set to false.
In this case visual studio has to OPEN the file in the only encoding he knows which is either determined by the standard (first) or by using the default (when the standard doesn't have it specified clearly)
Instead if the autodetect is inactive, VSCode has default encoding A, the OS has B and the file has C, VSCode opens the file with encoding A messes the characters and saves it in A again.....but you ask jschardet to fix this bug it for you...even if jschardet isn't even invoked because the autodetect is false.

And guess what? you don't even use the "encoding detection" during the "replace all" so if i want to replace a tag in an XML file (which doesn't even have chars outside the [a-zA-Z]) "tagA" with "tagB" you do the process above of ignoring everything and damaging ALL the user files at once (because you open the whole file and save the whole file)

But when dozens of users open the bug report in VSCode (because it is) you classify it as a "feature request". 🤡

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants