Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python files etc. with encoding comment (PEP263) open wrong as UTF-8 always. "Stage selected ranges" false-uses UTF-8 even when enc changed #116639

Closed
kxrob opened this issue Feb 14, 2021 · 3 comments
Assignees
Labels
*duplicate Issue identified as a duplicate of another issue(s)

Comments

@kxrob
Copy link

kxrob commented Feb 14, 2021

Steps to Reproduce:

Create Python file with encoding magic comment ( PEP263 ) in first or second line. E.g.

# -*- coding: latin_1 -*-
# test comment aouAOUäöüßàá

or : 

# -*- coding: latin-1 -*-
...

# -*- coding: iso-8859-1 -*-
...

Open the file in VSCode => The encoding shown in status line (bottom right) is always wrong as UTF-8 and nonsense chars are displayed. In practice this causes a great mix up - particularly when there are only few beyond ASCII chars and you don't notice it problem for long. I have many files (e.g. stuff migrated from Python 2 which had no UTF8 default, rather a LATIN-1 fall back) which do not use UFT-8 default encoding and have an encoding magic tag.

Then when the encoding is force-changed (click on UFT-8 in status line) via "Reopen with Encoding" and later selection-right-click "Stage selected ranges" is done (partial git add-ing) UTF-8 is somehow used always while feeding things to git - even when the right encoding is used otherwise (for file display, save etc.) . This causes all lines with beyond ASCII chars to be messed into the git staging area and so on ....

The same problem exists for other file types like Ruby which use a similar coding tag / magic comment ( coding[:=]\s*([-\w.]+) in first 2 lines ). Note that this kind of encoding defintion in text / script files is used rather universally - comparable to the XML encoding tag.

(Using the Python extension does not improve that. Anyway this issue is too basic and belongs to the editor core)

Hints for implementation:

  • `regex_encoding_cmnt = "coding[:=]\s*([-\w.]+)"
  • See how SciTE, notepad++ detect such encoding definitions from encoding magic comments/tags (and XML tag. and even HTML content type tag meanwhile I think)

Does this issue occur when all extensions are disabled?: Yes

  • VS Code Version: 1.53.2
  • OS Version: Win10 up-to-date/ver2004
@gjsjohnmurray
Copy link
Contributor

VS Code uses the chardet library, and @bpasero already requested support for this there

aadsm/jschardet#41

@bpasero bpasero added the *duplicate Issue identified as a duplicate of another issue(s) label Feb 15, 2021
@kxrob
Copy link
Author

kxrob commented Feb 15, 2021

VS Code uses the chardet library, and @bpasero already requested support for this there

aadsm/jschardet#41

I think, no matter whats added to jschardet or not, it is and was completely odd to even try to use jschardet in that context from the beginning:

jschardet is a (poor) port of Python's chardet. And not even the Python library will look for PEP263 or XML style tags - because that is not the purpose of these libs. They are specifically designed for human natural language text "soups" - like mp3 ID3 song texts which motivated jschardet. It is not for programming language source code with precise definitions for the handling of encodings: When there is a wrong encoding definition, it is deterministically the programmers fault. Python chardet will likely never parse source code tags (off-topic), and the author of jschardet seems to not be convinced at all - and if he would, that would deviate from the intention of "Port of python's chardet" and the nature of the lib. So this will likely never come.

Also the jschardet does not guess UTF-8 when I check my example "latin-1" scripts, but some "ISO-8859-2" or "windows-1250 (Hungarian)" when the text contains a few ISO-8859-1/latin-1 (German) letters. Because it does not support ISO-8859-1. It has only a very small subset of Python's chardet. And the guess would be very weak anyway with typically only few such chars.
Thus VS Code does not just use that output e.g. "ISO-8859-2" of jschardet (because of the low probability?), but probably uses the UTF-8 Python 3 default - because of ".py" ..

So think there is no better option, than to detect those 2 or 3 wide spread tag variants - with high (or only) priority before using any text-based guesser. Anyway I think there is no good reason at all to use a text-based guesser like jschardet in addition on source code. Lottery and deterministic compilers don't fit together.

I think the issue should be re-opened or be addressed somehow.

See also:
#36230 (comment)
(and the other comments in #36230)

@github-actions github-actions bot locked and limited conversation to collaborators Apr 1, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
*duplicate Issue identified as a duplicate of another issue(s)
Projects
None yet
Development

No branches or pull requests

4 participants
@bpasero @gjsjohnmurray @kxrob and others