Auto guess encoding #21416

katainaka0503 · 2017-02-25T14:12:48Z

This related to #5388.
This PR is modified one of #10013.

From #10013,
Added 'files.autoDetectEncoding' in settings.json to decide whether to automatically detect encoding.

mention-bot · 2017-02-25T14:12:49Z

@katainaka0503, thanks for your PR! By analyzing the history of the files in this pull request, we identified @bpasero and @egamma to be potential reviewers.

msftclas · 2017-02-25T14:12:52Z

@katainaka0503,
Thanks for having already signed the Contribution License Agreement. Your agreement was validated by Microsoft. We will now review your pull request.
Thanks,
Microsoft Pull Request Bot

buzzzzer · 2017-02-27T11:13:44Z

@katainaka0503
could be better used files.encoding = 'auto' instead of files.autoDetectEncoding ?

katainaka0503 · 2017-02-27T12:21:58Z

@buzzzzer
I had considered that before I implemented this. And I realized that when opening a new file, VSCode can't specify preferred encoding if 'files.encoding' is set to 'auto'.

bpasero · 2017-02-27T12:50:22Z

Thanks for this PR, current status is that we are reviewing the implications of JSChardets LGPL license for VS Code.

katainaka0503 · 2017-03-01T13:31:11Z

Rebased.

vzarytovskii · 2017-03-02T12:23:01Z

src/vs/base/node/encoding.ts

@@ -92,4 +93,24 @@ export function detectEncodingByBOMFromBuffer(buffer: NodeBuffer, bytesRead: num
 */
 export function detectEncodingByBOM(file: string): TPromise<string> {
 	return stream.readExactlyByFile(file, 3).then(({buffer, bytesRead}) => detectEncodingByBOMFromBuffer(buffer, bytesRead));
+}
+
+const IGNORE_ENCODINGS = ['ascii', 'utf-8', 'utf-16', 'urf-32'];


This meant to be "utf-32" instead of "urf-32", right?

Yes. Just a typo. I am so sorry.

LiPengfei19820619 · 2017-03-10T13:20:43Z

I expect this feature to be merged into VSCode. Thanks very much.

bpasero · 2017-03-10T13:56:18Z

No status update yet from what I wrote here.

LiPengfei19820619 · 2017-03-13T05:49:04Z

Any progress? Thanks.

katainaka0503 · 2017-03-13T05:54:57Z

I think it is no use rushing.

katainaka0503 · 2017-03-17T20:12:17Z

Rebased

msftclas · 2017-03-17T20:56:01Z

@katainaka0503,
Thanks for having already signed the Contribution License Agreement. Your agreement was validated by Microsoft. We will now review your pull request.
Thanks,
Microsoft Pull Request Bot

bpasero · 2017-03-19T07:52:19Z

@katainaka0503 have you thought about allowing a user to ad-hoc detect the encoding of an opened file from the encoding picker?

From a file, click on the encoding and pick the option to Reopen with Encoding:

In this list, the first entry (maybe with a separator) could be to automatically detect the encoding from the file:

One thing to keep in mind is that currently we only allow the first 512 bytes of the file to be used when detecting the encoding. This might cause some encodings to not get detected properly. If we have an explicit action for detection, we could maybe increase the buffer size to give the detection a higher chance of finding an encoding.

Otherwise this PR seems well programmed, good job. Still waiting on the license approval.

katainaka0503 · 2017-03-19T08:38:09Z

@bpasero Thanks! I am very pleased.

I think it's very good if it is able to select automatically detect encoding from encoding picker.
But I don't like to select the entry every time I open the file.

I think one of the problems is the word detect implies vscode can detect the encoding properly every time, so maybe the word guess should be used instead for example.

And also In the description of files.autoDetectEncoding , it's should be written that Files may be detected inappropriately. for example.

What's do you think?

bpasero · 2017-03-19T10:03:01Z

@katainaka0503 yes, this would be an optional thing to "guess" the encoding from the picker in case the user does not want to opt in to the setting for all files. Since we already have jschardet in the system, this functionality could be added quite easily I guess.

As for the setting, maybe it would be better to rename it to autoGuessEncoding and have the wording be When enabled, will attempt to guess the character set encoding when opening files.

I would then also suggest to rename the flag from autoDetectEncoding to autoGuessEncoding.

As you probably already figured out, we are using iconv-lite to convert to and from encodings. If jschardet is returning an encoding with a specific name, we have to ensure that this is the same name used in iconv-lite (see here for a list of supported encodings in iconv-lite). If the names are not matching, the check for encodingExists would fail.

Finally I am running into some issues when testing this and I think this exposes a problem with encoding auto detection in general. After making changes to a file and reopening it, the detected encoding is suddenly different and characters cannot be displayed anymore. Steps to reproduce:

create a file with the contents below saved as cp1250 encoding (I was using Sublime Text for this)
open the file in VS Code with encoding detection, it gets opened as ISO-8859-2
add the characters öäü to the end of the file in a newline
reopen the file and notice how the encoding is now detected as windows-1255

Contents of file:

// A kódolás szinte minden adatátviteli és kommunikációs rendszerben használható, és a következõ európai nyelvek megjelenítésére alkalmas: bosnyák, cseh, horvát, lengyel, magyar, román, szerb (a latinbetÿs írásmóddal), szerbhorvát, szlovák, szlovén, alsó-szorb és felsõ-szorb.

After reopening:

katainaka0503 · 2017-03-19T11:35:48Z

@bpasero Ok, I understand.
And I'll implement this.

Then, the remaining tasks are

Fix description.
Rename autoDetectEncoding to autoGuessEncoding.
Research whether jschardet can be configured to return only with higher confidence
Research difference of encoding names between jschardet and iconv-lite
Add autoGuessEncoding entry in encoding picker.

Thanks for precise test!

katainaka0503 · 2017-03-19T11:55:16Z

@bpasero I found there are two ways to add entry in encoding picker.

One is to add autoGuessEncoding to encoding picker.
The another is to add Shift_JIS(Guessed from content) or ISO-8859-2(Guessed from content).

I think the latter is better. Because users can decide based on result of guessing encoding.

bpasero · 2017-03-21T02:52:55Z

Maybe we go with what we have now and wait for more user information on how this behaves in the specific cases.

katainaka0503 · 2017-03-23T03:46:18Z

I added a guessed encoding entry in encoding picker as a prototype.
Now, using buffer size 512 byte.

bpasero · 2017-03-28T00:20:59Z

Merged. did some slight changes to encoding picker to prevent showing encodings multiple times.

katainaka0503 · 2017-03-28T03:02:54Z

@bpasero Thanks!

bpasero · 2017-03-29T14:24:30Z

@katainaka0503 just so that you know, there is a number of cases already where encoding guessing does not work: aadsm/jschardet#31, aadsm/jschardet#29, aadsm/jschardet#30

katainaka0503 · 2017-03-31T05:02:23Z

@bpasero Thanks for letting me know! I'll check them.

msftclas added the cla-already-signed label Feb 25, 2017

katainaka0503 force-pushed the auto-detect-encoding branch from 3bc0245 to 6e673eb Compare February 25, 2017 14:51

bpasero self-assigned this Feb 25, 2017

katainaka0503 force-pushed the auto-detect-encoding branch 2 times, most recently from 0e327c7 to d2993c8 Compare March 1, 2017 13:30

katainaka0503 force-pushed the auto-detect-encoding branch from d2993c8 to 81fc6a4 Compare March 1, 2017 15:18

vzarytovskii suggested changes Mar 2, 2017

View reviewed changes

hakudev mentioned this pull request Mar 16, 2017

how to auto detect , auto encoding ? #4846

Closed

katainaka0503 force-pushed the auto-detect-encoding branch from 05a72cd to 63866ff Compare March 17, 2017 20:11

katainaka0503 closed this Mar 17, 2017

katainaka0503 reopened this Mar 17, 2017

msftclas added the cla-already-signed label Mar 17, 2017

katainaka0503 changed the title ~~Auto detect encoding~~ Auto guess encoding Mar 19, 2017

katainaka0503 force-pushed the auto-detect-encoding branch from fa508f2 to a68d703 Compare March 23, 2017 03:44

tomoki1207 and others added 10 commits March 23, 2017 13:01

detect encoding

519daf6

Add jschardet to shrinkwrap

5ecf4d2

Add flag to decide whether to automatically detect

d8ef757

Fixed typo

3de68cd

Rename autoDetectEncoding to autoGuessEncoding

e1a4360

Introduce minimum threshold

a2af83d

Fix encoding key

dacf174

Fix

286ce44

Add auto guessed entry to encoding picker

2194d68

Fixed name

d812384

katainaka0503 force-pushed the auto-detect-encoding branch from e160f24 to d812384 Compare March 23, 2017 04:04

katainaka0503 and others added 7 commits March 23, 2017 13:17

Fix name

4d4cfea

Merge remote-tracking branch 'upstream/master' into auto-detect-encoding

dc8f08a

jschardet 1.4.2

5bd5d15

💄

efa2c38

💄

58e9d1e

💄

9c0dc52

fix encoding guess in picker

7915b89

bpasero added this to the March 2017 milestone Mar 28, 2017

bpasero merged commit a9b9534 into microsoft:master Mar 28, 2017

katainaka0503 deleted the auto-detect-encoding branch March 28, 2017 04:53

katainaka0503 mentioned this pull request Mar 31, 2017

Encoding auto guessing: Use buffer properly #23722

Merged

github-actions bot locked and limited conversation to collaborators Mar 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto guess encoding #21416

Auto guess encoding #21416

katainaka0503 commented Feb 25, 2017

mention-bot commented Feb 25, 2017

msftclas commented Feb 25, 2017

buzzzzer commented Feb 27, 2017

katainaka0503 commented Feb 27, 2017

bpasero commented Feb 27, 2017

katainaka0503 commented Mar 1, 2017

vzarytovskii Mar 2, 2017

katainaka0503 Mar 2, 2017

katainaka0503 Mar 2, 2017

LiPengfei19820619 commented Mar 10, 2017

bpasero commented Mar 10, 2017

LiPengfei19820619 commented Mar 13, 2017

katainaka0503 commented Mar 13, 2017 •

edited

Loading

katainaka0503 commented Mar 17, 2017

msftclas commented Mar 17, 2017

bpasero commented Mar 19, 2017 •

edited

Loading

katainaka0503 commented Mar 19, 2017

bpasero commented Mar 19, 2017 •

edited

Loading

katainaka0503 commented Mar 19, 2017 •

edited

Loading

katainaka0503 commented Mar 19, 2017 •

edited

Loading

bpasero commented Mar 21, 2017

katainaka0503 commented Mar 23, 2017 •

edited

Loading

bpasero commented Mar 28, 2017

katainaka0503 commented Mar 28, 2017

bpasero commented Mar 29, 2017

katainaka0503 commented Mar 31, 2017

Auto guess encoding #21416

Auto guess encoding #21416

Conversation

katainaka0503 commented Feb 25, 2017

mention-bot commented Feb 25, 2017

msftclas commented Feb 25, 2017

buzzzzer commented Feb 27, 2017

katainaka0503 commented Feb 27, 2017

bpasero commented Feb 27, 2017

katainaka0503 commented Mar 1, 2017

vzarytovskii Mar 2, 2017

Choose a reason for hiding this comment

katainaka0503 Mar 2, 2017

Choose a reason for hiding this comment

katainaka0503 Mar 2, 2017

Choose a reason for hiding this comment

LiPengfei19820619 commented Mar 10, 2017

bpasero commented Mar 10, 2017

LiPengfei19820619 commented Mar 13, 2017

katainaka0503 commented Mar 13, 2017 • edited Loading

katainaka0503 commented Mar 17, 2017

msftclas commented Mar 17, 2017

bpasero commented Mar 19, 2017 • edited Loading

katainaka0503 commented Mar 19, 2017

bpasero commented Mar 19, 2017 • edited Loading

katainaka0503 commented Mar 19, 2017 • edited Loading

katainaka0503 commented Mar 19, 2017 • edited Loading

bpasero commented Mar 21, 2017

katainaka0503 commented Mar 23, 2017 • edited Loading

bpasero commented Mar 28, 2017

katainaka0503 commented Mar 28, 2017

bpasero commented Mar 29, 2017

katainaka0503 commented Mar 31, 2017

katainaka0503 commented Mar 13, 2017 •

edited

Loading

bpasero commented Mar 19, 2017 •

edited

Loading

bpasero commented Mar 19, 2017 •

edited

Loading

katainaka0503 commented Mar 19, 2017 •

edited

Loading

katainaka0503 commented Mar 19, 2017 •

edited

Loading

katainaka0503 commented Mar 23, 2017 •

edited

Loading