-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] bin tools are using jstrencode(1) when they should be decoding JSON encoded strings #2752
Comments
We remain concerned with the issue of JSON string encoding and decoding. We suspect that the details of GH-issuecomment-2471493104 were either unclear or glossed over. We don't have access to the shell now, but when we return we plan to do more testing. Nevertheless what we saw in the code concerned us enough to file this bug and halt the Great Fork Merge process. |
Okay. Without having the time to read this I am admittedly confused because taking UTF-8 to Unicode is encoding and it seemed to work. But I will have to look at this tomorrow. Sorry. |
I just read it and it doesn't make sense. It sounds like a terminology issue but taking a UTF-8 code point and converting it into a Unicode symbol is encoding not decoding. Please explain what you are getting at. I will look at this tomorrow. |
UTF-8 is an encoding of Unicode stuff. JSON string encoding of "real" strings, decoding of JSON encoded strings into "real" strings is another matter. When "\uD83D\uDD25" one expects to get the "real" string of:
P.S.The man page for
The term "JSON decoded strings" is somewhat meaningless. JSON doesn't decode strings. The so-called JSON specification requires all strings to be encoded, and thus produce JSON encoded strings. See GH-ssuecomment-2471493104. The man page for
|
All sources say the opposite of what you're saying though. That's why I swapped the terms. The man page can be updated of course as can documentation. |
JSON requires strings to be encoded. This JSON string encoding has nil to do with how UTF-8 encodes Unicode stuff. Quoting from GH-issuecomment-2471493104: JSON string encoding, at a minimum, requires the string to be surrounded by double quotes. At a minimum, encoding will result in the prepending and appending a double quote character. JSON string encoding ALSO requires one to convert things like ASCII newlines into "\n". There are other important back-slashing requirements such as dealing with double quotes are within the "real" string, backslashes, tabs, etc. (that need to be handed during the JSON string encoding process) Quoting from GH-issuecomment-2471493104 again: Encoding this "real" string:
into this JSON encoded string: "This \"string\" has a newline\nin the middle and at the end\n" The above is JSON string encoding. Now the question of what to do with non-ASCII stuff IS a matter for Unicode / UTF-8 encoding and decoding. But that is NOT JSON string encoding, nor is it decoding of JSON encoded strings. The issue 13 in the other repo raised a concern about JSON encoded strings that contained \uHexHexHexHex-stuff was being handed as a side effect of decoding JSON encoded strings. Tools such as $ echo '"œßåé"' | jsp --no-color --indent 4
"\u0153\u00df\u00e5\u00e9" NOTE: The pipe input above is a JSON encoded string. The output produced by the BTW: we think the The "\u0153\u00df\u00e5\u00e9" and produce the original "real" string: Instead `jstrencode(1) does this! We don't mind if, in the process of JSON string encoding (what So by all means, let And by all means, let We just need to ALSO be sure that when Both "\uD83D\uDD25" and "🔥" are valid JSON encoded strings. They should convert into the same real string 🔥 as well. We hope this helps. |
If there is a comment bug in the source comments, then those should be fixed. We looked at the comment for the Are we missing something? There certainly could have been some copy and paste errors when code for one side of encoding was converted into decoding, for example. UPDATE 0We recommend to prioritize on GH-issuecomment-2472008592 and the core of this issue first, before worrying about source code comments and man pages. |
It could be a comment bug yes: when I tried to (later) correct the comments by swapping from one function to the other, it might have been a mistake. However ... I see something really odd.
Okay so then it maybe isn't a comment bug?
It could be. But if the encoding comment looks correct and I swapped it then it seems like it might be right and this is not an issue? But even so see below.
Of course. I'll post a new comment with something funny though. |
Here's something funny. All sources say that converting const json_to_encode = {
name: "\u0f0f"
};
const json_to_decode = '{"name": "\u0f0f"}';
const json_decoded = JSON.parse(json_to_decode);
const json_encoded = JSON.stringify(json_to_encode);
document.writeln(json_decoded.name);
document.write(json_encoded); shows in the html file:
which suggests that BOTH encoding and decoding convert the Now given that the function we have is What do you think? |
Meanwhile I had an interesting idea based on this, as I was waking up (a common thing of programmers as you surely know :-) ). What if the jstrencode/jstrdecode tools had an option to parse the string as JSON? It would be something like (I think - I would have to look at it more and only just did) something like... For encoding:
and for decoding:
Now what would the purpose of this be? Perhaps sanity checks or maybe for some experiment or else because part of the parsing is the encoding (which might suggest that only the encoding one should have the option but that's why the post/pre actions). |
Yes. But the point is that when UTF-8
Aha. Now I wonder about this. Perhaps the problem is that some of the options need to be moved from one tool to the other? That seems likely. Though in that case will the output of the UTF-8 code points for the website then show the right string? If it adds
Okay so it seems likely that some of the options/functionality has to be moved over too, and the names alone cannot be swapped? For instance: $ cat nl
$ jstrdecode < nl
\n\n
$ jstrencode < nl
Warning: json_encode: found non-\-escaped char: 0x0a
Warning: jstrencode_stream: error while encoding stdin buffer
Warning: main: error while encoding processing stdin
? Unless there is also some confusion with the terms?
Okay and I see... $ jstrdecode < foo.txt
This \"string\" has a newline\nin the middle and at the end\n But the thing is how do we determine when to encode and when to decode, then? I can see what you mean here: JSON with a
Hmm ... okay so that might be something to consider too. The question we have I think then: what do we do here? Perhaps the tool names should be reverted again but then we have to decide how to proceed with the code points?
Yes.
Right. It is encoded. Which suggests that the decoded is the
When it converts it to
That's because all the sources say that that IS encoding, not decoding. Except for the example I gave above which suggests both do it. And this is why the tools here use the jstrencode tool, not the jstrdecode tool. So minus the fact that the javascript example suggests that both encoding and decoding should print out the fire based on the emoji I gave (above), it should be good, unless you want to quibble about terminology. But since below you talk about how both should do it then it seems like that matters less if at all. Now as you say though, the json encoding/decoding is not the same thing as unicode. On the other hand you did raise the problem of it not doing it at all.
I wonder if there is a way to do it for both like the javascript above shows.
It does this already indeed.
Well jstrdecode will take the fire emoji and output the fire emoji: $ cat fire.json | jstrdecode -n
🔥\n but it appears that the Still it should I believe also output the emoji for both encoding and decoding.
Is this the real problem then? The tool names should be swapped back and both should print out the fire emoji from the fire code point? This way the json encoded strings will be correctly encoded but still print out the encoded form? I think that sounds reasonable but how to go about it I'm not sure yet.
That's true. And that's what happens with |
I guess the following. Please correct me if I'm wrong.
I have a thought on what might allow this to happen but I am not sure. I have to look at the code too. |
.. first step is to swap the names again. That'll be fun. |
Okay the filenames and terms are swapped. The next step would be to make sure that both encode and decode convert code points to unicode symbols. I think I have an idea how to do this. But I will have to go afk very soon. |
Made a commit ... not pushing yet. Have to go afk. Once I'm back I'll work on the unicode problem. Then we can figure out which tool belongs in the website. Hoping I can manage this today. |
Ugh. The real problem is that the json_encode() function (after name change) uses the table ... not the parsing manually. I don't know how to fix that yet. I'll ponder it as I'm afk (or part of the time). |
Well I have an idea but unfortunately it might have to be done first .. but that means the table access will be wrong. This is because the '` for example in |
Just pushed the changes noted above. In a bit I will look at seeing if I can figure out the encoding/decoding of code points. |
Hmm .. my initial idea will not work. This is turning into a nightmare. |
Have another thought. Looking into it. |
That does not work either because the table converts |
I think i got it! Have to do more testing ... |
The one problem is that doing... $ jstrencode '\' fails when it should print |
I discovered another bug too ... check this: $ jstrencode '\a'
'\\a Should not have the first character there, but just UPDATE 0Or perhaps that is a display issue? |
Oh! I know why. It's the echo; using |
Great .. head detached. Might have to clone repo again on server. UPDATE 0Well that was easy to fix. Much better. |
Here's a bug though that I just uncovered. Seems old at a guess: $ printf '\b' | ./jstrencode
\b :( Well have to go afk a bit now ... back later I hope. UPDATE 0On the other hand .. $ echo '\b' | ./jstrencode
\\b\n What should this be really? Why does for some UPDATE 1Oh perhaps the printf is evaluating the ... either way this should be documented .. |
Might (not sure) have made progress .. have to go for a bit though. |
Okay well one thought is this. Since it is not STRICTLY a UTF-8 encoder but rather making a JSON encoder, it could be considered 'okay', maybe, to not have it encoded the code points. But on the other hand other encoders show both. The question is whether those also show |
QUESTION THAT HAS TO BE ANSWEREDOkay this is extremely interesting. const json_to_encode = {
name: "\u0f0f\\"
};
const json_to_decode = '{"name": "\u0f0f"}';
const json_decoded = JSON.parse(json_to_decode);
const json_encoded = JSON.stringify(json_to_encode);
document.writeln(json_decoded.name);
document.write(json_encoded); This shows just:
So the question is how to even deal with the I guess it might be different if the Nope. And I guess this is because it turns it into an escape character which might or might not be valid. So am I inputting things wrong or is the concept of the |
We seem to have a much bigger problem. See the bug report I just opened which has at least three problems. Now since the website relies on the restored jstrdecode and not jstrencode this might not be a problem for the website which would be good but the issues I have uncovered seem to be really bad. Please let me know what you want me to do. My guess is I should sync the current code to mkiocccentry and then also test I can do the testing later today possibly but if not today then tomorrow. Nonetheless the jstrencode tool seems to be broken in numerous ways. Any help would be EXTREMELY appreciated. |
We are not sure we see the issue here. This is valid JSON: {"name":"༏\\"} The so-called JSON spec requires a backslash to be JSON encoded into a double backslash. To JSON encode this string, To decode the JSON encoded string: "༏\" one should get: |
Yes but JavaScript decoding of json does not have that. I don't understand it. Well it might be for if you want a literal backslash. However I think the issue was something else, something that is a much bigger problem (related to the escaping of characters in the bug report I opened today) and will require other changes. Thus I think this right here is not as important to consider. Thankfully jstrdecode does work at least and that's what the website needs. |
Should be revert the code base to before these changes occurred? I.e., to revert the "other repo" back to: commit 65f3d4ceecce4f9307ec82cd0ec0cb8e5a39b003 and to never this repo back to: commit 184cc9b We release that there are a few typo / fixes that are not related to the jstrencode / jstrdecode tools after those commits that might have to be reapplied. But doing this would avoid and restore the web site to a state that worked well. Some very selective fixes could be applied back on top of the revert and avoid the problematic code. We did a test of this reversion and besides an issue of UPDATE 0I am sure we can find a commit for this and the other repo for which the web site worked. Then any typo/editor changes can be applied so that fixes and typo stuff that were done after that can be reapplied. Sadly, we do NOT have much time left. |
The tools are still broken as noted in comments in the other repo. |
That example you give should work with jstrdecode as it stands. |
See my reply there. It's not as simple as that. In some cases quotes do need to be escaped and in other cases they do not. So the decoder assumes that it's inside a string when your input was the string itself. |
... but what I meant was the tool with the purpose of utf-8 to unicode works. |
I think it would be a big mistake to revert. Many other fixes were made, some quite important. The jstrdecode works fine for utf-8 to unicode. |
So please do not revert. Let's see how this goes. I'm sure that after |
And so you know: the change in terms was just that: the functionality was fine as the commands were swapped. The only change here is the tool names. |
Can you make some PRs that fix things in the next day or two? If you can, we would be happy to not revert. |
In fact reverting to that commit you suggest would remove the fixes to the issue Leo opened, too, plus I think the html.sed file. |
I can. As soon as make www goes fine I will. I just have to let it go through. If I don't get the commit in today I'll do so tomorrow morning. |
If we reverted, then certain editorial and typo and related fixes would be re-applied to the set. |
Much more would be lost too, though. |
... and there are probably more than you realise. I have no idea what I did with 2018/ferguson and 2020/ferguson2 for instance. Nu checker was fixed. Many other things. It would be a nightmare and it would solve nothing as it was just a terminology swap. The unicode symbols are already being shown correctly. Keep in mind that although yes the tools were renamed (in error it seems) the tools were also changed here so there was no functional difference here; the only things changed were names. That's it. |
We are faced with a difficult choice .. not hold IOCCC28 until sometime next year, or revert to the web site state before we went on vacation and them selectively applying the edits on top of that. Right now it appears that things not in a stable shape. But as you suggested that can be fixed. We are in an extreme travel complications that will take is a few days to get out of .. and we will he off the internet again for about 2 days. Can you get this repo and the other repo into good shape by then? If so we would be very happy to go with that! |
Like I said the html files are fine it's just the tool names. I already have in my clone the tools swapped again and just checking But right now the html files are as they should be: unicode symbols and so on. There should be no delay in the contest at all. So yes I can get this repo fine. Just waiting for |
.. and safe travels! |
|
Ah .. encoding has the skip quotes. Maybe decoding needs that as well? |
Is there an existing issue for this?
Describe the bug
Tools such as:
Are using
jstrencode(1)
when they need to take JSON encoded strings and decode them into "real strings" for use in markdown and HTML pages.What you expect
Tools such as:
would use
jstrdecode(1)
to decode JSON encoded strings, converting them into "real" strings.Environment
Anything else?
Things may have gone amiss with commit e58bd97 (dated: Fri Nov 1 06:45:02 2024 -0700)
This seems to be linked to commit 0a7c9673fa7797f1e9c2c87dea377edb03816f03 (dated: Thu Oct 31 11:57:38 2024 -0700) from the "other repo".
UPDATE 0
This is a Great Fork Merge show stopper.
The text was updated successfully, but these errors were encountered: