Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: the concept of jstrencode(1) appears to be both very wrong and very buggy #28

Open
1 task done
xexyl opened this issue Nov 14, 2024 · 2 comments
Open
1 task done
Labels
bug Something isn't working

Comments

@xexyl
Copy link
Owner

xexyl commented Nov 14, 2024

Is there an existing issue for this?

  • I have searched for existing issues and did not find anything like this

Describe the bug

Based on JavaScript encoding of JSON, there appear to be multiple problems with our tool.

First of all, it appears that the \ being converted to a \\ is wrong. I will give examples in the what we should expect section.

The next problem is that code points should be converted to unicode symbols just like with jstrdecode(1). This is how it is with JavaScript too: both encode and decode need to do this.

Another problem is that the \ escape chars should not be done the way we have it. See what to expect for examples.

What you expect

In order of the problems above, here are examples of what jstrencode(1) does and what JavaScript does.

If we have the JavaScript:

const json_to_encode = {
          name: "\u0f0ff\bo"
        };

it SHOULD (i.e. this is what JavaScript does) encode to:

{"name":"༏f\bo"}

but our tool converts the string to:

$  jstrencode '"\u0f0ff\bo"'
\"\\u0f0ff\\bo\"

Notice the double \ before the b! Now on the subject of the escaped quotes in the beginning see the anything else section.

Now as far as the code points go, javascript of:

const json_to_encode = {
          name: "\u0f0f"
        };

converts to:

{"name":"༏"}

but our tool does this:

$ jstrencode '{"name": "\u0f0f"}'
{\"name\": \"\\u0f0f\"}

or as just a string:

$ jstrencode '"\u0f0f"'
\"\\u0f0f\"

Now if the \uxxxx was converted to a unicode symbol it might just be that the \" surrounding the output would be the difference but I am not sure of this.

The third problem appears to be even worse. There might be other cases where something like this happens but anyway the JavaScript:

const json_to_encode = {
          name: "\u0f0ff\co"
        };

... turns into:

{"name":"༏fco"}

Notice how the \c has the \ silently removed and the c is by itself (or rather it's after the unicode symbol and before the o).

But our tool does something extremely wrong. First as a string by itself:

$ jstrencode '"\u0f0ff\co"'

0;xexyl@xexyz:~$

.. or as the json with {}:

$ jstrencode {"name":"\u0f0ff\co"}

0;xexyl@xexyz:~$

As can be seen the encoding concept we have appears to be totally wrong.

Environment

  • OS: linux for tool tests, macOS for JavaScript but really should be n/a
  • Device: n/a
  • Compiler: n/a

jparse_bug_report.sh output

n/a

Anything else?

As for the escaped " surrounding the string. I guess it depends on how this tool will be used but even so it would appear to be the wrong default.

As for the doubling of \ it also depends on how we need to use it.

But clearly the \ of non-valid escape chars seems to be wrong.

Of course it could be that it's because I am tired or because I do not really know JavaScript but it might not be. I think it probably isn't either of those. But of course depending on what our tool needs to do that deviates from a normal json encoder ....

@xexyl
Copy link
Owner Author

xexyl commented Nov 14, 2024

One thing that should be done and which I will do is to rename the utf8encode() function to not suggest encoding or decoding but rather unicode. This way there is no confusion in the decode function.

xexyl added a commit that referenced this issue Nov 14, 2024
Renamed utf8encode() to utf8_to_unicode() to be less confusing as although
converting a code point to unicode is called encoding (from all sources we have
seen) in JSON, according to encoders **AND** decoders out there, a code point in a
string should be converted to unicode.

A number of bugs have been identified in jstrencode(1) during discussion in
'the other repo' (or one of the 'other repos'). This is in jstrencode(1) now
(as of yesterday); prior to yesterday it was in jstrdecode(1) due to the
unfortunate swap in names. This swap happened because when focusing on issue #13
(the decoding - which turned out to be encoding - bug of \uxxxx) focus of the
fact that the jstrencode(1) tool is not strictly UTF-8 but rather JSON
was lost. The man page has had these bugs added so it is important to
remove them when the bugs are fixed. A new issue #28 has been opened for
these problems.
@xexyl
Copy link
Owner Author

xexyl commented Nov 14, 2024

Something I do not understand, looking at the so-called json spec and the JavaScript output, is that for the non valid \- characters, it would appear that the \ should just be part of the string. But then in JavaScript it looks like the \ is removed and the character is there.

I presume that this is not understanding the standard correctly (if we can call something about it 'correct' :-) ) or because I am tired and not reading the grammar right. I rather suspect it's a bit of both.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant