Write 4-byte characters (surrogate pairs) instead of escapes #1335

rnetuka · 2024-09-12T11:22:10Z

Issue: #223

A fix for issue #223. Some languages (Japanese, Chinese, ...) use 4-byte characters and it's annoying to have them marshaled as escapes.

I've tried to fix this with minimum intrusions. I've only added this feature/fix to _writeStringSegment2 method(s) and left the rest without it. If you like the changes for other methods calling _outputMultiByteChar, just let me know.

pjfanning · 2024-09-12T11:59:59Z

this probably imposes a performance hit on all users
some users may even prefer the existing behaviour
I think this should be only applied if a new JsonGenerator.Feature is added and defaulted to false.

https://www.javadoc.io/static/com.fasterxml.jackson.core/jackson-core/2.17.2/com/fasterxml/jackson/core/JsonGenerator.Feature.html

pjfanning · 2024-09-12T12:01:49Z

The change that you have made to PackageVersion.java.in breaks the build - please revert it.

rnetuka · 2024-09-12T12:18:43Z

@pjfanning Thanks for your review. I've added the feature. Not sure if StreamWriteFeature is the right place...

src/main/java/com/fasterxml/jackson/core/json/UTF8JsonGenerator.java

pjfanning · 2024-09-12T12:26:10Z

@pjfanning Thanks for your review. I've added the feature. Not sure if StreamWriteFeature is the right place...

Seems ok to me. Do you know if other Generator implementations handle surrogates ok? Eg WriterBasedJsonGenerator.

pjfanning · 2024-09-12T12:47:55Z

@rnetuka @cowtowncoder It looks like the Windows CI build cannot handle UTF-8 chars in source files.

rnetuka · 2024-09-12T13:00:57Z

@pjfanning WriterBasedJsonGenerator works fine and writes the surrogate pairs automatically.

As for UTF-8 chars in source files, do you mean the ones in the new test? If so, I can just replace them with escapes.

pjfanning · 2024-09-12T13:21:05Z

@pjfanning WriterBasedJsonGenerator works fine and writes the surrogate pairs automatically.

As for UTF-8 chars in source files, do you mean the ones in the new test? If so, I can just replace them with escapes.

The Windows build doesn't like any of your non-ASCII chars. Could you have escape all the non-ASCII chars?

src/main/java/com/fasterxml/jackson/core/JsonGenerator.java

cowtowncoder · 2024-09-12T16:11:05Z

Overall looks good; as long as overhead is localized into handling of 4-byte cases it should be fine (performance would be my main concern in general).

src/test/java/com/fasterxml/jackson/core/json/StringGenerationTest.java

cowtowncoder · 2024-09-12T16:17:49Z

@rnetuka Forgot to say: big thank you for contributing this!!! I'll try to make sure this gets in 2.18.0 release; as I said this looks pretty good already.

rnetuka · 2024-09-13T10:44:24Z

@cowtowncoder Based on you review, I've updated the PR with following changes:

moved the feature into the group // Misc other
added javadoc for the feature
reused Surrogate223Test (which is now green) instead of creating new test method

I've renamed the new feature to COMBINE_UNICODE_SURROGATES which I believe will better describe its purpose. I've also put the feature only to JsonGenerator (previously I put it also in StreamWriteFeature, but that would make it kind of duplicate).

Thank you for you review. If you have any more comments, just let me know.

cowtowncoder · 2024-09-16T01:22:48Z

@rnetuka Just realized I don't have a CLA from you. However, I do have corporate-CLA from Red Hat -- and it looks like you are employed by Red Hat. If this is the case, I think you are covered and we are good.
Please let me know if this is the case (will remove "cla-needed" label).

Also: I see failed unit tests -- is this PR still work-in-progress?
(I had to approve CI run since this is your first contribution, hopefully it will now re-run on all pushes)

rnetuka · 2024-09-16T06:59:59Z

@cowtowncoder Yes, I'm Red Hat employee.

As for the test, I've noticed Surrogate223Test#surrogatesCharBacked is written for an option to escape the surrogate pairs for StringWriter/WriterBasedJsonGenerator. WriterBasedJsonGenerator always combines the surrogate pairs together, so I've removed the latter part of the test. Please, take a look and confirm this is ok for you.

To sum it up, if merged at this point:
UTF8JsonGenerator escapes 4-byte characters by default
UTF8JsonGenerator can be configured to combine the bytes instead
WriterBasedJsonGenerator always combines the bytes together

Unless you request other changes, I consider this PR done.

cowtowncoder · 2024-09-17T01:12:52Z

src/main/java/com/fasterxml/jackson/core/json/UTF8JsonGenerator.java

+    private int _outputSurrogatePair(char highSurrogate, char lowSurrogate, int outputPtr) {
+        String s = String.valueOf(highSurrogate) + lowSurrogate;
+        byte[] bytes = s.getBytes(StandardCharsets.UTF_8);
+        System.arraycopy(bytes, 0, _outputBuffer, outputPtr, bytes.length);


Yikes. This is rather wasteful way of doing this... Let me think we really shouldn't have do too this much allocation of things.

Would

private int _outputSurrogatePair(char highSurrogate, char lowSurrogate, int outputPtr) { int unicode = 0x10000; unicode += (highSurrogate & 0x03FF) << 10; unicode += (lowSurrogate & 0x03FF); byte[] utf8 = new byte[4]; utf8[0] = (byte) (0b11110000 + ((unicode & 0b00000000_00011100_00000000_00000000) >> 18)); utf8[1] = (byte) (0b10000000 + ((unicode & 0b00000000_00000011_11110000_00000000) >> 12)); utf8[2] = (byte) (0b10000000 + ((unicode & 0b00000000_00000000_00001111_11000000) >> 6)); utf8[3] = (byte) (0b10000000 + (unicode & 0b00000000_00000000_00000000_00111111)); System.arraycopy(utf8, 0, _outputBuffer, outputPtr, utf8.length); return outputPtr + utf8.length; }

be better?

Yes. :)

Can also eliminate temporary buffer, assign directly in _outputBuffer.

Thank you for working with me here!

Copied with minor modifications; also realized unit test was not verifying actual round-trip encode/decode, added. Now catches some of problems I tried.
(ideally we'd have more tests for surrogate pairs but will have to do for now).

Will merge.

cowtowncoder · 2024-09-18T01:50:29Z

Thank you for contributing this, @rnetuka ! Will merge in 2.18 for inclusion in 2.18.0 release.

rnetuka force-pushed the write-surrogate-pairs branch from 4c25457 to 5eabe2e Compare September 12, 2024 12:17

pjfanning reviewed Sep 12, 2024

View reviewed changes

src/main/java/com/fasterxml/jackson/core/json/UTF8JsonGenerator.java Outdated Show resolved Hide resolved

rnetuka force-pushed the write-surrogate-pairs branch from 969c112 to 8a197e7 Compare September 12, 2024 12:54

cowtowncoder reviewed Sep 12, 2024

View reviewed changes

src/main/java/com/fasterxml/jackson/core/JsonGenerator.java Outdated Show resolved Hide resolved

cowtowncoder reviewed Sep 12, 2024

View reviewed changes

src/test/java/com/fasterxml/jackson/core/json/StringGenerationTest.java Outdated Show resolved Hide resolved

cowtowncoder reviewed Sep 12, 2024

View reviewed changes

src/test/java/com/fasterxml/jackson/core/json/StringGenerationTest.java Outdated Show resolved Hide resolved

rnetuka force-pushed the write-surrogate-pairs branch from 8a197e7 to bd3e8db Compare September 13, 2024 10:36

rnetuka requested review from cowtowncoder and pjfanning September 13, 2024 10:46

rnetuka force-pushed the write-surrogate-pairs branch from bd3e8db to 8f0de85 Compare September 13, 2024 11:19

cowtowncoder added the cla-needed PR looks good (although may also require code review), but CLA needed from submitter label Sep 16, 2024

Write 4-byte characters (surrogate pairs) instead of escapes

6fad16d

rnetuka force-pushed the write-surrogate-pairs branch from 8f0de85 to 6fad16d Compare September 16, 2024 06:57

cowtowncoder removed the cla-needed PR looks good (although may also require code review), but CLA needed from submitter label Sep 16, 2024

Add JsonWriteFeature counterpart (important!), update release notes

f57c128

cowtowncoder reviewed Sep 17, 2024

View reviewed changes

cowtowncoder added 2 commits September 16, 2024 18:16

Try to optimize encoding of surrogate pairs further

ef5d673

...

90bf947

rnetuka requested a review from cowtowncoder September 17, 2024 10:38

cowtowncoder added 2 commits September 17, 2024 18:42

Optimize surrogate output, verify output goodness

30a4797

...

d782c65

cowtowncoder approved these changes Sep 18, 2024

View reviewed changes

cowtowncoder mentioned this pull request Sep 18, 2024

UTF8JsonGenerator writes supplementary characters as a surrogate pair -- should use 4-byte encoding #223

Closed

Merge branch '2.18' into write-surrogate-pairs

bf503b2

cowtowncoder merged commit 4d47aae into FasterXML:2.18 Sep 18, 2024
7 checks passed

rnetuka mentioned this pull request Sep 18, 2024

[CAMEL-21199] Camel-jackson not properly marshalling 4-byte characters apache/camel#15515

Closed

pjfanning mentioned this pull request Nov 13, 2024

Non-surrogate characters being incorrectly combined when JsonWriteFeature.COMBINE_UNICODE_SURROGATES_IN_UTF8 is enabled #1359

Closed

cowtowncoder added this to the 2.18.0 milestone Dec 4, 2024

This was referenced Feb 4, 2025

feature COMBINE_UNICODE_SURROGATES_IN_UTF8 doesn't work when custom characterEscape is used #1398

Closed

fix the surrogate utf8 feature when custom characterEscapes is used #1399

Merged

Uh oh!

Write 4-byte characters (surrogate pairs) instead of escapes #1335

Write 4-byte characters (surrogate pairs) instead of escapes #1335

Uh oh!

Conversation

rnetuka commented Sep 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pjfanning commented Sep 12, 2024

Uh oh!

pjfanning commented Sep 12, 2024

Uh oh!

rnetuka commented Sep 12, 2024

Uh oh!

Uh oh!

pjfanning commented Sep 12, 2024

Uh oh!

pjfanning commented Sep 12, 2024

Uh oh!

rnetuka commented Sep 12, 2024

Uh oh!

pjfanning commented Sep 12, 2024

Uh oh!

Uh oh!

cowtowncoder commented Sep 12, 2024

Uh oh!

Uh oh!

Uh oh!

cowtowncoder commented Sep 12, 2024

Uh oh!

rnetuka commented Sep 13, 2024

Uh oh!

cowtowncoder commented Sep 16, 2024

Uh oh!

rnetuka commented Sep 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cowtowncoder Sep 17, 2024

Choose a reason for hiding this comment

Uh oh!

rnetuka Sep 17, 2024

Choose a reason for hiding this comment

Uh oh!

cowtowncoder Sep 18, 2024

Choose a reason for hiding this comment

Uh oh!

cowtowncoder Sep 18, 2024

Choose a reason for hiding this comment

Uh oh!

cowtowncoder commented Sep 18, 2024

Uh oh!

Uh oh!

Uh oh!

rnetuka commented Sep 12, 2024 •

edited

Loading

rnetuka commented Sep 16, 2024 •

edited

Loading