Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] bin tools are using jstrencode(1) when they should be decoding JSON encoded strings #2752

Open
1 task done
lcn2 opened this issue Nov 12, 2024 · 64 comments
Open
1 task done
Assignees
Labels
bug Something isn't working top priority This a top priory critical path issue for next milestone

Comments

@lcn2
Copy link

lcn2 commented Nov 12, 2024

Is there an existing issue for this?

  • I have searched for existing issues and did not find anything like this

Describe the bug

Tools such as:

  • bin/cvt-submission.sh
  • bin/entry2csv.sh
  • bin/gen-authors.sh
  • bin/gen-location.sh
  • bin/gen-years.sh
  • bin/output-index-author.sh
  • bin/output-year-index.sh
  • bin/subst.entry-index.sh

Are using jstrencode(1) when they need to take JSON encoded strings and decode them into "real strings" for use in markdown and HTML pages.

What you expect

Tools such as:

  • bin/cvt-submission.sh
  • bin/entry2csv.sh
  • bin/gen-authors.sh
  • bin/gen-location.sh
  • bin/gen-years.sh
  • bin/output-index-author.sh
  • bin/output-year-index.sh
  • bin/subst.entry-index.sh

would use jstrdecode(1) to decode JSON encoded strings, converting them into "real" strings.

Environment

  • OS: n/a
  • Device: n/a
  • Compiler: n/a

Anything else?

Things may have gone amiss with commit e58bd97 (dated: Fri Nov 1 06:45:02 2024 -0700)

This seems to be linked to commit 0a7c9673fa7797f1e9c2c87dea377edb03816f03 (dated: Thu Oct 31 11:57:38 2024 -0700) from the "other repo".

UPDATE 0

This is a Great Fork Merge show stopper.

@lcn2 lcn2 added bug Something isn't working top priority This a top priory critical path issue for next milestone labels Nov 12, 2024
@lcn2
Copy link
Author

lcn2 commented Nov 12, 2024

We remain concerned with the issue of JSON string encoding and decoding.

We suspect that the details of GH-issuecomment-2471493104 were either unclear or glossed over.

We don't have access to the shell now, but when we return we plan to do more testing. Nevertheless what we saw in the code concerned us enough to file this bug and halt the Great Fork Merge process.

@xexyl
Copy link

xexyl commented Nov 12, 2024

Okay. Without having the time to read this I am admittedly confused because taking UTF-8 to Unicode is encoding and it seemed to work.

But I will have to look at this tomorrow. Sorry.

@xexyl
Copy link

xexyl commented Nov 12, 2024

I just read it and it doesn't make sense. It sounds like a terminology issue but taking a UTF-8 code point and converting it into a Unicode symbol is encoding not decoding.

Please explain what you are getting at. I will look at this tomorrow.

@lcn2
Copy link
Author

lcn2 commented Nov 12, 2024

UTF-8 is an encoding of Unicode stuff.

JSON string encoding of "real" strings, decoding of JSON encoded strings into "real" strings is another matter.

When jstrdecode(1) decodes a JSON encoded string such as:

"\uD83D\uDD25"

one expects to get the "real" string of:

🔥

P.S.

The man page for jstrencode(1), as we think we mentioned before, seems wrong. It reads:

jstrencode encodes JSON decoded strings given on the command line.

The term "JSON decoded strings" is somewhat meaningless. JSON doesn't decode strings. The so-called JSON specification requires all strings to be encoded, and thus produce JSON encoded strings. See GH-ssuecomment-2471493104.

The man page for jstrencode(1) should read something like:

jstrencode encodes strings into JSON encoded strings accoding to the so-called JSON specification.

@xexyl
Copy link

xexyl commented Nov 12, 2024

All sources say the opposite of what you're saying though. That's why I swapped the terms.

The man page can be updated of course as can documentation.

@lcn2
Copy link
Author

lcn2 commented Nov 13, 2024

I just read it and it doesn't make sense. It sounds like a terminology issue but taking a UTF-8 code point and converting it into a Unicode symbol is encoding not decoding.

Please explain what you are getting at. I will look at this tomorrow.

JSON requires strings to be encoded. This JSON string encoding has nil to do with how UTF-8 encodes Unicode stuff.

Quoting from GH-issuecomment-2471493104:

JSON string encoding, at a minimum, requires the string to be surrounded by double quotes. At a minimum, encoding will result in the prepending and appending a double quote character.

JSON string encoding ALSO requires one to convert things like ASCII newlines into "\n". There are other important back-slashing requirements such as dealing with double quotes are within the "real" string, backslashes, tabs, etc. (that need to be handed during the JSON string encoding process)

Quoting from GH-issuecomment-2471493104 again:

Encoding this "real" string:

This "string" has a newline
in the middle and at the end

into this JSON encoded string:

"This \"string\" has a newline\nin the middle and at the end\n"

The above is JSON string encoding.

Now the question of what to do with non-ASCII stuff IS a matter for Unicode / UTF-8 encoding and decoding. But that is NOT JSON string encoding, nor is it decoding of JSON encoded strings.

The issue 13 in the other repo raised a concern about JSON encoded strings that contained \uHexHexHexHex-stuff was being handed as a side effect of decoding JSON encoded strings.

Tools such as jsp(1) formatted or "pretty printed" JSON encoded strings like this:

$ echo '"œßåé"' | jsp --no-color --indent 4
"\u0153\u00df\u00e5\u00e9"

NOTE: The pipe input above is a JSON encoded string. The output produced by the jsp(1) tool is also a JSON encoded string. The jsp(1) tool happens to choose to formatted or "pretty printed" by printing \uHexHexHexHex stuff. While it is a questionable practice from the point of view of formatting or "pretty printing", it is technically correct as BOTH JSON encoded strings are valid and are equivalent.

BTW: we think the jsp(1) tool makes things ugly when to does that sort of formatting.

The jstrdecode(1) tool and the internal JSON encoded string decoding functionality needs to be able to that the JSON encoded string:

"\u0153\u00df\u00e5\u00e9"

and produce the original "real" string: œßåé.

Instead `jstrencode(1) does this!

We don't mind if, in the process of JSON string encoding (what jstrencode(1) should do), Unicode stuff is left alone and/or converted in such a way that when one decodes (what jstrdecode(1) should do), produces nice looking Unicode stuff.

So by all means, let jstrencode(1) convert the string: 🔥by JSON encoding into "🔥" as that is valid JSON.

And by all means, let jstrdecode(1) convert the JSON encoded string "🔥" into the real string: 🔥

We just need to ALSO be sure that when jstrdecode(1) is given the JSON encoded string "\uD83D\uDD25" we get the real string: 🔥 as well.

Both "\uD83D\uDD25" and "🔥" are valid JSON encoded strings. They should convert into the same real string 🔥 as well.

We hope this helps.

@lcn2
Copy link
Author

lcn2 commented Nov 13, 2024

All sources say the opposite of what you're saying though. That's why I swapped the terms.

If there is a comment bug in the source comments, then those should be fixed.

We looked at the comment for the json_encode(char const *ptr, size_t len, size_t *retlen) function in jparse/json_parse.c in the "other repo". It looks correct.

Are we missing something? There certainly could have been some copy and paste errors when code for one side of encoding was converted into decoding, for example.

UPDATE 0

We recommend to prioritize on GH-issuecomment-2472008592 and the core of this issue first, before worrying about source code comments and man pages.

@xexyl
Copy link

xexyl commented Nov 13, 2024

All sources say the opposite of what you're saying though. That's why I swapped the terms.

If there is a comment bug in the source comments, then those should be fixed.'

It could be a comment bug yes: when I tried to (later) correct the comments by swapping from one function to the other, it might have been a mistake. However ... I see something really odd.

We looked at the comment for the json_encode(char const *ptr, size_t len, size_t *retlen) function in jparse/json_parse.c in the "other repo". It looks correct.

Okay so then it maybe isn't a comment bug?

Are we missing something? There certainly could have been some copy and paste errors when code for one side of encoding was converted into decoding, for example.

It could be. But if the encoding comment looks correct and I swapped it then it seems like it might be right and this is not an issue? But even so see below.

UPDATE 0

We recommend to prioritize on GH-issuecomment-2472008592 and the core of this issue first, before worrying about source code comments and man pages.

Of course.

I'll post a new comment with something funny though.

@xexyl
Copy link

xexyl commented Nov 13, 2024

Here's something funny.

All sources say that converting '\u0f0f' to its unicode symbol is ENCODING. But what about JavaScript's JSON object (or whatever it is)? Check this javascript out:

const json_to_encode = {
          name: "\u0f0f"
        };
            const json_to_decode = '{"name": "\u0f0f"}';

            const json_decoded = JSON.parse(json_to_decode);
            const json_encoded = JSON.stringify(json_to_encode);

        document.writeln(json_decoded.name);
        document.write(json_encoded);

shows in the html file:

༏ {"name":"༏"}

which suggests that BOTH encoding and decoding convert the \uxxxx to its unicode symbol!

Now given that the function we have is utf8encode() and given many other sources I would say that the right term to use for what we have is ENCODE and given that it does print out the correct string in the website then this is probably okay.

What do you think?

@xexyl
Copy link

xexyl commented Nov 13, 2024

Meanwhile I had an interesting idea based on this, as I was waking up (a common thing of programmers as you surely know :-) ).

What if the jstrencode/jstrdecode tools had an option to parse the string as JSON? It would be something like (I think - I would have to look at it more and only just did) something like...

For encoding:

	-j		parse as JSON post encoding (def: do not)
	-j level	set JSON debug level

and for decoding:

	-j		parse as JSON pre encoding (def: do not)
	-J level	set JSON debug level

Now what would the purpose of this be? Perhaps sanity checks or maybe for some experiment or else because part of the parsing is the encoding (which might suggest that only the encoding one should have the option but that's why the post/pre actions).

@xexyl
Copy link

xexyl commented Nov 13, 2024

I just read it and it doesn't make sense. It sounds like a terminology issue but taking a UTF-8 code point and converting it into a Unicode symbol is encoding not decoding.
Please explain what you are getting at. I will look at this tomorrow.

JSON requires strings to be encoded. This JSON string encoding has nil to do with how UTF-8 encodes Unicode stuff.

Yes. But the point is that when UTF-8 '\u0f0f' (for example) is encoded it turns into and that's exactly what jstrencode does.

Quoting from GH-issuecomment-2471493104:

JSON string encoding, at a minimum, requires the string to be surrounded by double quotes. At a minimum, encoding will result in the prepending and appending a double quote character.

Aha. Now I wonder about this. Perhaps the problem is that some of the options need to be moved from one tool to the other? That seems likely. Though in that case will the output of the UTF-8 code points for the website then show the right string? If it adds "s then it would be the wrong output, right? Perhaps that's why you want the decode tool to also do this?

JSON string encoding ALSO requires one to convert things like ASCII newlines into "\n". There are other important back-slashing requirements such as dealing with double quotes are within the "real" string, backslashes, tabs, etc. (that need to be handed during the JSON string encoding process)

Okay so it seems likely that some of the options/functionality has to be moved over too, and the names alone cannot be swapped? For instance:

$ cat nl


$ jstrdecode < nl
\n\n
$ jstrencode < nl
Warning: json_encode: found non-\-escaped char: 0x0a
Warning: jstrencode_stream: error while encoding stdin buffer
Warning: main: error while encoding processing stdin

?

Unless there is also some confusion with the terms?

Quoting from GH-issuecomment-2471493104 again:

Encoding this "real" string:

This "string" has a newline
in the middle and at the end

into this JSON encoded string:

"This \"string\" has a newline\nin the middle and at the end\n"

The above is JSON string encoding.

Okay and I see...

$ jstrdecode  < foo.txt 
This \"string\" has a newline\nin the middle and at the end\n

But the thing is how do we determine when to encode and when to decode, then? I can see what you mean here: JSON with a " inside a quote is wrong unless it's escaped. But on the other hand encoding does seem to also convert a code point into its unicode symbol. This is a mess!

Now the question of what to do with non-ASCII stuff IS a matter for Unicode / UTF-8 encoding and decoding. But that is NOT JSON string encoding, nor is it decoding of JSON encoded strings.

Hmm ... okay so that might be something to consider too. The question we have I think then: what do we do here? Perhaps the tool names should be reverted again but then we have to decide how to proceed with the code points?

The issue 13 in the other repo raised a concern about JSON encoded strings that contained \uHexHexHexHex-stuff was being handed as a side effect of decoding JSON encoded strings.

Yes.

Tools such as jsp(1) formatted or "pretty printed" JSON encoded strings like this:

$ echo '"œßåé"' | jsp --no-color --indent 4
"\u0153\u00df\u00e5\u00e9"

NOTE: The pipe input above is a JSON encoded string. The output produced by the jsp(1) tool is also a JSON encoded string. The jsp(1) tool happens to choose to formatted or "pretty printed" by printing \uHexHexHexHex stuff. While it is a questionable practice from the point of view of formatting or "pretty printing", it is technically correct as BOTH JSON encoded strings are valid and are equivalent.

Right. It is encoded. Which suggests that the decoded is the \uxxxx string. Though as my example javascript above shows it seems that it's both. Now as you say this is not json encoding/decoding so perhaps I do need to swap the names again. But in this case we do also need to determine how to encode the code points in the case of it happening. Unless we do not want this feature?

BTW: we think the jsp(1) tool makes things ugly when to does that sort of formatting.

When it converts it to \uxxxx you mean?

The jstrdecode(1) tool and the internal JSON encoded string decoding functionality needs to be able to that the JSON encoded string:

"\u0153\u00df\u00e5\u00e9"

and produce the original "real" string: œßåé.

Instead `jstrencode(1) does this!

That's because all the sources say that that IS encoding, not decoding. Except for the example I gave above which suggests both do it.

And this is why the tools here use the jstrencode tool, not the jstrdecode tool. So minus the fact that the javascript example suggests that both encoding and decoding should print out the fire based on the emoji I gave (above), it should be good, unless you want to quibble about terminology. But since below you talk about how both should do it then it seems like that matters less if at all.

Now as you say though, the json encoding/decoding is not the same thing as unicode. On the other hand you did raise the problem of it not doing it at all.

We don't mind if, in the process of JSON string encoding (what jstrencode(1) should do), Unicode stuff is left alone and/or converted in such a way that when one decodes (what jstrdecode(1) should do), produces nice looking Unicode stuff.

I wonder if there is a way to do it for both like the javascript above shows.

So by all means, let jstrencode(1) convert the string: 🔥by JSON encoding into "🔥" as that is valid JSON.

It does this already indeed.

And by all means, let jstrdecode(1) convert the JSON encoded string "🔥" into the real string: 🔥

Well jstrdecode will take the fire emoji and output the fire emoji:

$ cat fire.json | jstrdecode -n
🔥\n

but it appears that the -n option does not work. Hmm.... I wonder why.

Still it should I believe also output the emoji for both encoding and decoding.

We just need to ALSO be sure that when jstrdecode(1) is given the JSON encoded string "\uD83D\uDD25" we get the real string: 🔥 as well.

Is this the real problem then? The tool names should be swapped back and both should print out the fire emoji from the fire code point? This way the json encoded strings will be correctly encoded but still print out the encoded form? I think that sounds reasonable but how to go about it I'm not sure yet.

Both "\uD83D\uDD25" and "🔥" are valid JSON encoded strings. They should convert into the same real string 🔥 as well.

That's true. And that's what happens with jstrencode(1). That's why the confusion. But as above... I'm not sure.

@xexyl
Copy link

xexyl commented Nov 13, 2024

I guess the following. Please correct me if I'm wrong.

  1. The tools should have their named swapped again. This is because the point was for JSON encoding/decoding, not encoding of the strings themselves.
  2. Then both decoding and encoding of the code points should turn it into the proper unicode symbol.
  3. After this the documentation can be updated for both.

I have a thought on what might allow this to happen but I am not sure. I have to look at the code too.

@xexyl
Copy link

xexyl commented Nov 13, 2024

.. first step is to swap the names again. That'll be fun.

@xexyl
Copy link

xexyl commented Nov 13, 2024

Okay the filenames and terms are swapped. The next step would be to make sure that both encode and decode convert code points to unicode symbols. I think I have an idea how to do this. But I will have to go afk very soon.

@xexyl
Copy link

xexyl commented Nov 13, 2024

Made a commit ... not pushing yet. Have to go afk. Once I'm back I'll work on the unicode problem. Then we can figure out which tool belongs in the website. Hoping I can manage this today.

@xexyl
Copy link

xexyl commented Nov 13, 2024

Ugh. The real problem is that the json_encode() function (after name change) uses the table ... not the parsing manually. I don't know how to fix that yet. I'll ponder it as I'm afk (or part of the time).

@xexyl
Copy link

xexyl commented Nov 13, 2024

Well I have an idea but unfortunately it might have to be done first .. but that means the table access will be wrong. This is because the '` for example in \u will be changed to \\u. So something has to be figured out. But it might end up that this table can be worked out to be not needed or more useful. We shall see!

@xexyl
Copy link

xexyl commented Nov 13, 2024

Just pushed the changes noted above. In a bit I will look at seeing if I can figure out the encoding/decoding of code points.

@xexyl
Copy link

xexyl commented Nov 13, 2024

Hmm .. my initial idea will not work. This is turning into a nightmare.

@xexyl
Copy link

xexyl commented Nov 13, 2024

Have another thought. Looking into it.

@xexyl
Copy link

xexyl commented Nov 13, 2024

That does not work either because the table converts \ into two \ so the \u is not matched. It is not clear to me if it's okay to change it to be a single one .. yet.

@xexyl
Copy link

xexyl commented Nov 13, 2024

I think i got it! Have to do more testing ...

@xexyl
Copy link

xexyl commented Nov 13, 2024

The one problem is that doing...

$ jstrencode '\'

fails when it should print \\. It seems like the encode function might need to go through the string itself, somehow, like the decoding process, but with slightly different rules.

@xexyl
Copy link

xexyl commented Nov 13, 2024

I discovered another bug too ... check this:

$ jstrencode '\a'
'\\a

Should not have the first character there, but just \\a.

UPDATE 0

Or perhaps that is a display issue?

@xexyl
Copy link

xexyl commented Nov 14, 2024

Oh! I know why. It's the echo; using printf(1) and it works fine.

@xexyl
Copy link

xexyl commented Nov 14, 2024

Great .. head detached. Might have to clone repo again on server.

UPDATE 0

Well that was easy to fix. Much better.

@xexyl
Copy link

xexyl commented Nov 14, 2024

Here's a bug though that I just uncovered. Seems old at a guess:

$ printf '\b' | ./jstrencode 
\b

:(

Well have to go afk a bit now ... back later I hope.

UPDATE 0

On the other hand ..

$ echo '\b' | ./jstrencode 
\\b\n

What should this be really? Why does for some printf work (and echo does not) but others it is the opposite order?

UPDATE 1

Oh perhaps the printf is evaluating the \b?

... either way this should be documented ..

@xexyl
Copy link

xexyl commented Nov 14, 2024

Might (not sure) have made progress .. have to go for a bit though.

@xexyl
Copy link

xexyl commented Nov 14, 2024

Okay well one thought is this.

Since it is not STRICTLY a UTF-8 encoder but rather making a JSON encoder, it could be considered 'okay', maybe, to not have it encoded the code points. But on the other hand other encoders show both. The question is whether those also show \\ for \. I have to test this out.

@xexyl
Copy link

xexyl commented Nov 14, 2024

QUESTION THAT HAS TO BE ANSWERED

Okay this is extremely interesting.

const json_to_encode = {
          name: "\u0f0f\\"
        };
            const json_to_decode = '{"name": "\u0f0f"}';

            const json_decoded = JSON.parse(json_to_decode);
            const json_encoded = JSON.stringify(json_to_encode);

        document.writeln(json_decoded.name);
        document.write(json_encoded);

This shows just:

༏ {"name":"༏\\"}

So the question is how to even deal with the \. I mean the \ escapes the next character, right?

I guess it might be different if the \ is in the middle of the string? Let's see...

Nope. And I guess this is because it turns it into an escape character which might or might not be valid.

So am I inputting things wrong or is the concept of the jstrencode(1) tool wrong? This has to be answered before proceeding (possibly).

@xexyl
Copy link

xexyl commented Nov 14, 2024

We seem to have a much bigger problem. See the bug report I just opened which has at least three problems.

xexyl/jparse#28.

Now since the website relies on the restored jstrdecode and not jstrencode this might not be a problem for the website which would be good but the issues I have uncovered seem to be really bad.

Please let me know what you want me to do. My guess is I should sync the current code to mkiocccentry and then also test make www in the website (after updating the tools to use the proper version of jstrdecode(1).

I can do the testing later today possibly but if not today then tomorrow.

Nonetheless the jstrencode tool seems to be broken in numerous ways. Any help would be EXTREMELY appreciated.

@lcn2
Copy link
Author

lcn2 commented Nov 14, 2024

QUESTION THAT HAS TO BE ANSWERED

Okay this is extremely interesting.

const json_to_encode = {
          name: "\u0f0f\\"
        };
            const json_to_decode = '{"name": "\u0f0f"}';

            const json_decoded = JSON.parse(json_to_decode);
            const json_encoded = JSON.stringify(json_to_encode);

        document.writeln(json_decoded.name);
        document.write(json_encoded);

This shows just:

༏ {"name":"༏\\"}

So the question is how to even deal with the \. I mean the \ escapes the next character, right?

I guess it might be different if the \ is in the middle of the string? Let's see...

Nope. And I guess this is because it turns it into an escape character which might or might not be valid.

So am I inputting things wrong or is the concept of the jstrencode(1) tool wrong? This has to be answered before proceeding (possibly).

We are not sure we see the issue here. This is valid JSON:

{"name":"\\"}

The so-called JSON spec requires a backslash to be JSON encoded into a double backslash.

To JSON encode this string, ༏\ one gets: "༏\"

To decode the JSON encoded string: "༏\" one should get: ༏\ back.

@xexyl
Copy link

xexyl commented Nov 14, 2024

QUESTION THAT HAS TO BE ANSWERED

Okay this is extremely interesting.

const json_to_encode = {
          name: "\u0f0f\\"
        };
            const json_to_decode = '{"name": "\u0f0f"}';

            const json_decoded = JSON.parse(json_to_decode);
            const json_encoded = JSON.stringify(json_to_encode);

        document.writeln(json_decoded.name);
        document.write(json_encoded);

This shows just:

༏ {"name":"༏\\"}

So the question is how to even deal with the \. I mean the \ escapes the next character, right?
I guess it might be different if the \ is in the middle of the string? Let's see...
Nope. And I guess this is because it turns it into an escape character which might or might not be valid.
So am I inputting things wrong or is the concept of the jstrencode(1) tool wrong? This has to be answered before proceeding (possibly).

We are not sure we see the issue here. This is valid JSON:

{"name":"\\"}

The so-called JSON spec requires a backslash to be JSON encoded into a double backslash.

To JSON encode this string, ༏\ one gets: "༏"

To decode the JSON encoded string: "༏" one should get: ༏\ back.

Yes but JavaScript decoding of json does not have that. I don't understand it. Well it might be for if you want a literal backslash. However I think the issue was something else, something that is a much bigger problem (related to the escaping of characters in the bug report I opened today) and will require other changes. Thus I think this right here is not as important to consider.

Thankfully jstrdecode does work at least and that's what the website needs.

@lcn2
Copy link
Author

lcn2 commented Nov 14, 2024

We seem to have a much bigger problem. See the bug report I just opened which has at least three problems.

xexyl/jparse#28.

Now since the website relies on the restored jstrdecode and not jstrencode this might not be a problem for the website which would be good but the issues I have uncovered seem to be really bad.

Please let me know what you want me to do. My guess is I should sync the current code to mkiocccentry and then also test make www in the website (after updating the tools to use the proper version of jstrdecode(1).

I can do the testing later today possibly but if not today then tomorrow.

Nonetheless the jstrencode tool seems to be broken in numerous ways. Any help would be EXTREMELY appreciated.

Should be revert the code base to before these changes occurred?

I.e., to revert the "other repo" back to: commit 65f3d4ceecce4f9307ec82cd0ec0cb8e5a39b003

and to never this repo back to: commit 184cc9b

We release that there are a few typo / fixes that are not related to the jstrencode / jstrdecode tools after those commits that might have to be reapplied. But doing this would avoid and restore the web site to a state that worked well. Some very selective fixes could be applied back on top of the revert and avoid the problematic code.

We did a test of this reversion and besides an issue of 2020/kurdyukov4 in years.html the can be cleaned up, the site works.

UPDATE 0

I am sure we can find a commit for this and the other repo for which the web site worked. Then any typo/editor changes can be applied so that fixes and typo stuff that were done after that can be reapplied.

Sadly, we do NOT have much time left.

@lcn2
Copy link
Author

lcn2 commented Nov 14, 2024

Thankfully jstrdecode does work at least and that's what the website needs

The tools are still broken as noted in comments in the other repo.

@xexyl
Copy link

xexyl commented Nov 14, 2024

We seem to have a much bigger problem. See the bug report I just opened which has at least three problems.
xexyl/jparse#28.
Now since the website relies on the restored jstrdecode and not jstrencode this might not be a problem for the website which would be good but the issues I have uncovered seem to be really bad.
Please let me know what you want me to do. My guess is I should sync the current code to mkiocccentry and then also test make www in the website (after updating the tools to use the proper version of jstrdecode(1).
I can do the testing later today possibly but if not today then tomorrow.
Nonetheless the jstrencode tool seems to be broken in numerous ways. Any help would be EXTREMELY appreciated.

Should be revert the code base to before these changes occurred?

I.e., to revert the "other repo" back to: commit 65f3d4ceecce4f9307ec82cd0ec0cb8e5a39b003

and to never this repo back to: commit 184cc9b

We release that there are a few typo / fixes that are not related to the jstrencode / jstrdecode tools after those commits that might have to be reapplied. But doing this would avoid and restore the web site to a state that worked well. Some very selective fixes could be applied back on top of the revert and avoid the problematic code.

We did a test of this reversion and besides an issue of 2020/kurdyukov4 in years.html the can be cleaned up, the site works.

That example you give should work with jstrdecode as it stands.

@xexyl
Copy link

xexyl commented Nov 14, 2024

Thankfully jstrdecode does work at least and that's what the website needs

The tools are still broken as noted in comments in the other repo.

See my reply there. It's not as simple as that. In some cases quotes do need to be escaped and in other cases they do not.

So the decoder assumes that it's inside a string when your input was the string itself.

@xexyl
Copy link

xexyl commented Nov 14, 2024

... but what I meant was the tool with the purpose of utf-8 to unicode works.

@xexyl
Copy link

xexyl commented Nov 14, 2024

I think it would be a big mistake to revert. Many other fixes were made, some quite important.

The jstrdecode works fine for utf-8 to unicode.

@xexyl
Copy link

xexyl commented Nov 14, 2024

So please do not revert. Let's see how this goes. I'm sure that after make www nothing will be changed.

@xexyl
Copy link

xexyl commented Nov 14, 2024

And so you know: the change in terms was just that: the functionality was fine as the commands were swapped. The only change here is the tool names.

@lcn2
Copy link
Author

lcn2 commented Nov 14, 2024

So please do not revert. Let's see how this goes. I'm sure that after make www nothing will be changed.

Can you make some PRs that fix things in the next day or two? If you can, we would be happy to not revert.

@xexyl
Copy link

xexyl commented Nov 14, 2024

In fact reverting to that commit you suggest would remove the fixes to the issue Leo opened, too, plus I think the html.sed file.

@xexyl
Copy link

xexyl commented Nov 14, 2024

So please do not revert. Let's see how this goes. I'm sure that after make www nothing will be changed.

Can you make some PRs that fix things in the next day or two? If you can, we would be happy to not revert.

I can. As soon as make www goes fine I will. I just have to let it go through. If I don't get the commit in today I'll do so tomorrow morning.

@lcn2
Copy link
Author

lcn2 commented Nov 14, 2024

In fact reverting to that commit you suggest would remove the fixes to the issue Leo opened, too, plus I think the html.sed file.

If we reverted, then certain editorial and typo and related fixes would be re-applied to the set.

@xexyl
Copy link

xexyl commented Nov 14, 2024

In fact reverting to that commit you suggest would remove the fixes to the issue Leo opened, too, plus I think the html.sed file.

If we reverted, then certain editorial and typo and related fixes would be re-applied to the set.

Much more would be lost too, though.

@xexyl
Copy link

xexyl commented Nov 14, 2024

... and there are probably more than you realise. I have no idea what I did with 2018/ferguson and 2020/ferguson2 for instance. Nu checker was fixed. Many other things. It would be a nightmare and it would solve nothing as it was just a terminology swap. The unicode symbols are already being shown correctly.

Keep in mind that although yes the tools were renamed (in error it seems) the tools were also changed here so there was no functional difference here; the only things changed were names. That's it.

@lcn2
Copy link
Author

lcn2 commented Nov 14, 2024

In fact reverting to that commit you suggest would remove the fixes to the issue Leo opened, too, plus I think the html.sed file.

If we reverted, then certain editorial and typo and related fixes would be re-applied to the set.

Much more would be lost too, though.

We are faced with a difficult choice .. not hold IOCCC28 until sometime next year, or revert to the web site state before we went on vacation and them selectively applying the edits on top of that.

Right now it appears that things not in a stable shape. But as you suggested that can be fixed.

We are in an extreme travel complications that will take is a few days to get out of .. and we will he off the internet again for about 2 days.

Can you get this repo and the other repo into good shape by then? If so we would be very happy to go with that!

@xexyl
Copy link

xexyl commented Nov 14, 2024

In fact reverting to that commit you suggest would remove the fixes to the issue Leo opened, too, plus I think the html.sed file.

If we reverted, then certain editorial and typo and related fixes would be re-applied to the set.

Much more would be lost too, though.

We are faced with a difficult choice .. not hold IOCCC28 until sometime next year, or revert to the web site state before we went on vacation and them selectively applying the edits on top of that.

Right now it appears that things not in a stable shape. But as you suggested that can be fixed.

We are in an extreme travel complications that will take is a few days to get out of .. and we will he off the internet again for about 2 days.

Can you get this repo and the other repo into good shape by then? If so we would be very happy to go with that!

Like I said the html files are fine it's just the tool names. I already have in my clone the tools swapped again and just checking make www to make sure everything goes fine.

But right now the html files are as they should be: unicode symbols and so on. There should be no delay in the contest at all.

So yes I can get this repo fine. Just waiting for make www to be done. As long as no html files changed (besides bin/index.html due to change in tool names) I cannot imagine any other html file changing. Just bin/ files.

@xexyl
Copy link

xexyl commented Nov 14, 2024

An example of how the website is just fine:

Screenshot 2024-11-14 at 11 51 29

Everything is well ... except the tool names which I'm working on now.

@xexyl
Copy link

xexyl commented Nov 14, 2024

.. and safe travels!

@xexyl
Copy link

xexyl commented Nov 14, 2024

make www reported no problems! After you merge it should be fine. I had a thought on the jstrdecode issue with ": it could be that if it's the first and last character it would be valid but otherwise not. I have this vague memory that there already exists this functionality so maybe a flag has to be passed; not sure but I have to go now.

@xexyl
Copy link

xexyl commented Nov 14, 2024

Ah .. encoding has the skip quotes. Maybe decoding needs that as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working top priority This a top priory critical path issue for next milestone
Projects
None yet
Development

No branches or pull requests

2 participants