Python files etc. with encoding comment (PEP263) open wrong as UTF-8 always. "Stage selected ranges" false-uses UTF-8 even when enc changed #116639

kxrob · 2021-02-14T13:00:40Z

Steps to Reproduce:

Create Python file with encoding magic comment ( PEP263 ) in first or second line. E.g.

# -*- coding: latin_1 -*-
# test comment aouAOUäöüßàá

or : 

# -*- coding: latin-1 -*-
...

# -*- coding: iso-8859-1 -*-
...

Open the file in VSCode => The encoding shown in status line (bottom right) is always wrong as UTF-8 and nonsense chars are displayed. In practice this causes a great mix up - particularly when there are only few beyond ASCII chars and you don't notice it problem for long. I have many files (e.g. stuff migrated from Python 2 which had no UTF8 default, rather a LATIN-1 fall back) which do not use UFT-8 default encoding and have an encoding magic tag.

Then when the encoding is force-changed (click on UFT-8 in status line) via "Reopen with Encoding" and later selection-right-click "Stage selected ranges" is done (partial git add-ing) UTF-8 is somehow used always while feeding things to git - even when the right encoding is used otherwise (for file display, save etc.) . This causes all lines with beyond ASCII chars to be messed into the git staging area and so on ....

The same problem exists for other file types like Ruby which use a similar coding tag / magic comment ( coding[:=]\s*([-\w.]+) in first 2 lines ). Note that this kind of encoding defintion in text / script files is used rather universally - comparable to the XML encoding tag.

(Using the Python extension does not improve that. Anyway this issue is too basic and belongs to the editor core)

Hints for implementation:

`regex_encoding_cmnt = "coding[:=]\s*([-\w.]+)"
See how SciTE, notepad++ detect such encoding definitions from encoding magic comments/tags (and XML tag. and even HTML content type tag meanwhile I think)

Does this issue occur when all extensions are disabled?: Yes

VS Code Version: 1.53.2
OS Version: Win10 up-to-date/ver2004

The text was updated successfully, but these errors were encountered:

gjsjohnmurray · 2021-02-15T04:28:17Z

VS Code uses the chardet library, and @bpasero already requested support for this there

aadsm/jschardet#41

kxrob · 2021-02-15T17:59:12Z

VS Code uses the chardet library, and @bpasero already requested support for this there

aadsm/jschardet#41

I think, no matter whats added to jschardet or not, it is and was completely odd to even try to use jschardet in that context from the beginning:

jschardet is a (poor) port of Python's chardet. And not even the Python library will look for PEP263 or XML style tags - because that is not the purpose of these libs. They are specifically designed for human natural language text "soups" - like mp3 ID3 song texts which motivated jschardet. It is not for programming language source code with precise definitions for the handling of encodings: When there is a wrong encoding definition, it is deterministically the programmers fault. Python chardet will likely never parse source code tags (off-topic), and the author of jschardet seems to not be convinced at all - and if he would, that would deviate from the intention of "Port of python's chardet" and the nature of the lib. So this will likely never come.

Also the jschardet does not guess UTF-8 when I check my example "latin-1" scripts, but some "ISO-8859-2" or "windows-1250 (Hungarian)" when the text contains a few ISO-8859-1/latin-1 (German) letters. Because it does not support ISO-8859-1. It has only a very small subset of Python's chardet. And the guess would be very weak anyway with typically only few such chars.
Thus VS Code does not just use that output e.g. "ISO-8859-2" of jschardet (because of the low probability?), but probably uses the UTF-8 Python 3 default - because of ".py" ..

So think there is no better option, than to detect those 2 or 3 wide spread tag variants - with high (or only) priority before using any text-based guesser. Anyway I think there is no good reason at all to use a text-based guesser like jschardet in addition on source code. Lottery and deterministic compilers don't fit together.

I think the issue should be re-opened or be addressed somehow.

See also:
#36230 (comment)
(and the other comments in #36230)

github-actions bot added the new release label Feb 14, 2021

weinand assigned bpasero Feb 15, 2021

bpasero added the *duplicate Issue identified as a duplicate of another issue(s) label Feb 15, 2021

github-actions bot locked and limited conversation to collaborators Apr 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python files etc. with encoding comment (PEP263) open wrong as UTF-8 always. "Stage selected ranges" false-uses UTF-8 even when enc changed #116639

Python files etc. with encoding comment (PEP263) open wrong as UTF-8 always. "Stage selected ranges" false-uses UTF-8 even when enc changed #116639

kxrob commented Feb 14, 2021

gjsjohnmurray commented Feb 15, 2021

kxrob commented Feb 15, 2021

Python files etc. with encoding comment (PEP263) open wrong as UTF-8 always. "Stage selected ranges" false-uses UTF-8 even when enc changed #116639

Python files etc. with encoding comment (PEP263) open wrong as UTF-8 always. "Stage selected ranges" false-uses UTF-8 even when enc changed #116639

Comments

kxrob commented Feb 14, 2021

gjsjohnmurray commented Feb 15, 2021

kxrob commented Feb 15, 2021