Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto guess encoding #21416

Merged
merged 17 commits into from
Mar 28, 2017
Merged

Conversation

katainaka0503
Copy link
Contributor

This related to #5388.
This PR is modified one of #10013.

From #10013,
Added 'files.autoDetectEncoding' in settings.json to decide whether to automatically detect encoding.

@mention-bot
Copy link

@katainaka0503, thanks for your PR! By analyzing the history of the files in this pull request, we identified @bpasero and @egamma to be potential reviewers.

@msftclas
Copy link

@katainaka0503,
Thanks for having already signed the Contribution License Agreement. Your agreement was validated by Microsoft. We will now review your pull request.
Thanks,
Microsoft Pull Request Bot

@buzzzzer
Copy link

@katainaka0503
could be better used files.encoding = 'auto' instead of files.autoDetectEncoding ?

@katainaka0503
Copy link
Contributor Author

@buzzzzer
I had considered that before I implemented this. And I realized that when opening a new file, VSCode can't specify preferred encoding if 'files.encoding' is set to 'auto'.

@bpasero
Copy link
Member

bpasero commented Feb 27, 2017

Thanks for this PR, current status is that we are reviewing the implications of JSChardets LGPL license for VS Code.

@katainaka0503 katainaka0503 force-pushed the auto-detect-encoding branch 2 times, most recently from 0e327c7 to d2993c8 Compare March 1, 2017 13:30
@katainaka0503
Copy link
Contributor Author

Rebased.

@katainaka0503 katainaka0503 force-pushed the auto-detect-encoding branch from d2993c8 to 81fc6a4 Compare March 1, 2017 15:18
@@ -92,4 +93,24 @@ export function detectEncodingByBOMFromBuffer(buffer: NodeBuffer, bytesRead: num
*/
export function detectEncodingByBOM(file: string): TPromise<string> {
return stream.readExactlyByFile(file, 3).then(({buffer, bytesRead}) => detectEncodingByBOMFromBuffer(buffer, bytesRead));
}

const IGNORE_ENCODINGS = ['ascii', 'utf-8', 'utf-16', 'urf-32'];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This meant to be "utf-32" instead of "urf-32", right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Just a typo. I am so sorry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@LiPengfei19820619
Copy link

I expect this feature to be merged into VSCode. Thanks very much.

@bpasero
Copy link
Member

bpasero commented Mar 10, 2017

No status update yet from what I wrote here.

@LiPengfei19820619
Copy link

Any progress? Thanks.

@katainaka0503
Copy link
Contributor Author

katainaka0503 commented Mar 13, 2017

I think it is no use rushing.

@katainaka0503
Copy link
Contributor Author

Rebased

@msftclas
Copy link

@katainaka0503,
Thanks for having already signed the Contribution License Agreement. Your agreement was validated by Microsoft. We will now review your pull request.
Thanks,
Microsoft Pull Request Bot

@bpasero
Copy link
Member

bpasero commented Mar 19, 2017

@katainaka0503 have you thought about allowing a user to ad-hoc detect the encoding of an opened file from the encoding picker?

From a file, click on the encoding and pick the option to Reopen with Encoding:
image

In this list, the first entry (maybe with a separator) could be to automatically detect the encoding from the file:
image

One thing to keep in mind is that currently we only allow the first 512 bytes of the file to be used when detecting the encoding. This might cause some encodings to not get detected properly. If we have an explicit action for detection, we could maybe increase the buffer size to give the detection a higher chance of finding an encoding.

Otherwise this PR seems well programmed, good job. Still waiting on the license approval.

@katainaka0503
Copy link
Contributor Author

@bpasero Thanks! I am very pleased.

I think it's very good if it is able to select automatically detect encoding from encoding picker.
But I don't like to select the entry every time I open the file.

I think one of the problems is the word detect implies vscode can detect the encoding properly every time, so maybe the word guess should be used instead for example.

And also In the description of files.autoDetectEncoding , it's should be written that Files may be detected inappropriately. for example.

What's do you think?

@bpasero
Copy link
Member

bpasero commented Mar 19, 2017

@katainaka0503 yes, this would be an optional thing to "guess" the encoding from the picker in case the user does not want to opt in to the setting for all files. Since we already have jschardet in the system, this functionality could be added quite easily I guess.

As for the setting, maybe it would be better to rename it to autoGuessEncoding and have the wording be When enabled, will attempt to guess the character set encoding when opening files.

I would then also suggest to rename the flag from autoDetectEncoding to autoGuessEncoding.

As you probably already figured out, we are using iconv-lite to convert to and from encodings. If jschardet is returning an encoding with a specific name, we have to ensure that this is the same name used in iconv-lite (see here for a list of supported encodings in iconv-lite). If the names are not matching, the check for encodingExists would fail.

Finally I am running into some issues when testing this and I think this exposes a problem with encoding auto detection in general. After making changes to a file and reopening it, the detected encoding is suddenly different and characters cannot be displayed anymore. Steps to reproduce:

  • create a file with the contents below saved as cp1250 encoding (I was using Sublime Text for this)
  • open the file in VS Code with encoding detection, it gets opened as ISO-8859-2
  • add the characters öäü to the end of the file in a newline
  • reopen the file and notice how the encoding is now detected as windows-1255

Contents of file:

// A kódolás szinte minden adatátviteli és kommunikációs rendszerben használható, és a következõ európai nyelvek megjelenítésére alkalmas: bosnyák, cseh, horvát, lengyel, magyar, román, szerb (a latinbetÿs írásmóddal), szerbhorvát, szlovák, szlovén, alsó-szorb és felsõ-szorb.

After reopening:

image

@katainaka0503
Copy link
Contributor Author

katainaka0503 commented Mar 19, 2017

@bpasero Ok, I understand.
And I'll implement this.

Then, the remaining tasks are

  • Fix description.

  • Rename autoDetectEncoding to autoGuessEncoding.

  • Research whether jschardet can be configured to return only with higher confidence

  • Research difference of encoding names between jschardet and iconv-lite

  • Add autoGuessEncoding entry in encoding picker.

Thanks for precise test!

@katainaka0503
Copy link
Contributor Author

katainaka0503 commented Mar 19, 2017

@bpasero I found there are two ways to add entry in encoding picker.

One is to add autoGuessEncoding to encoding picker.
The another is to add Shift_JIS(Guessed from content) or ISO-8859-2(Guessed from content).

I think the latter is better. Because users can decide based on result of guessing encoding.

@katainaka0503 katainaka0503 changed the title Auto detect encoding Auto guess encoding Mar 19, 2017
@bpasero
Copy link
Member

bpasero commented Mar 21, 2017

Maybe we go with what we have now and wait for more user information on how this behaves in the specific cases.

@katainaka0503
Copy link
Contributor Author

katainaka0503 commented Mar 23, 2017

I added a guessed encoding entry in encoding picker as a prototype.
Now, using buffer size 512 byte.

@bpasero bpasero added this to the March 2017 milestone Mar 28, 2017
@bpasero bpasero merged commit a9b9534 into microsoft:master Mar 28, 2017
@bpasero
Copy link
Member

bpasero commented Mar 28, 2017

Merged. did some slight changes to encoding picker to prevent showing encodings multiple times.

@katainaka0503
Copy link
Contributor Author

@bpasero Thanks!

@katainaka0503 katainaka0503 deleted the auto-detect-encoding branch March 28, 2017 04:53
@bpasero
Copy link
Member

bpasero commented Mar 29, 2017

@katainaka0503 just so that you know, there is a number of cases already where encoding guessing does not work: aadsm/jschardet#31, aadsm/jschardet#29, aadsm/jschardet#30

@katainaka0503
Copy link
Contributor Author

@bpasero Thanks for letting me know! I'll check them.

@github-actions github-actions bot locked and limited conversation to collaborators Mar 27, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants