The C accelerator of json reports Invalid \uXXXX escape where the pure-Python decoder reports Unterminated string starting at, for a \uXXXX escape whose final hex digit is the last character of the input.
Minimal repro
The input is an opening quote followed by a complete A escape and no closing quote (7 characters: ", \, u, 0, 0, 4, 1):
>>> import json
>>> json.loads(r'"\u0041') # C accelerator
Traceback (most recent call last):
...
json.decoder.JSONDecodeError: Invalid \uXXXX escape: line 1 column 3 (char 2)
>>> from json import decoder
>>> decoder.py_scanstring(r'"\u0041', 1) # pure-Python
Traceback (most recent call last):
...
json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 1 (char 0)
The four hex digits 0041 form a complete, valid \uXXXX escape, so the string is merely unterminated. The pure-Python decoder diagnoses this correctly; the C accelerator misreports it as an invalid escape.
Root cause
Modules/_json.c, in scanstring_unicode():
end = next + 4;
if (end >= len) {
raise_errmsg("Invalid \\uXXXX escape", pystr, next - 1);
goto bail;
}
The four hex digits are read at indices next .. next+3, i.e. end-4 .. end-1. When end == len those indices are len-4 .. len-1 — all in bounds — so the escape is complete and valid; the input is simply unterminated. The check should be > rather than >=: only end > len means the escape itself runs past the end of the input. At end == len control should fall through, the forward scan finds no closing quote, and the existing raise_errmsg("Unterminated string starting at", ...) fires, matching the pure-Python decoder exactly.
Why this matters
This is a C-accelerator vs pure-Python parity / wrong-diagnostic bug: the two decoders disagree on the error class for the same input, and the C one points the user at the wrong problem (a non-existent invalid escape rather than the missing closing quote). The json C and pure-Python error messages were deliberately synchronized in bpo-5067, so divergence here regresses that contract.
This is distinct from gh-125660 (which is about the pure-Python decoder accepting invalid unicode escapes); here both decoders correctly reject genuinely truncated escapes, and the only divergence is the misclassification of a complete escape at end-of-input.
I have a one-character fix (>= → >) plus regression tests covering both the C and pure-Python decoders ready to submit.
Linked PRs
The C accelerator of
jsonreportsInvalid \uXXXX escapewhere the pure-Python decoder reportsUnterminated string starting at, for a\uXXXXescape whose final hex digit is the last character of the input.Minimal repro
The input is an opening quote followed by a complete
Aescape and no closing quote (7 characters:",\,u,0,0,4,1):The four hex digits
0041form a complete, valid\uXXXXescape, so the string is merely unterminated. The pure-Python decoder diagnoses this correctly; the C accelerator misreports it as an invalid escape.Root cause
Modules/_json.c, inscanstring_unicode():The four hex digits are read at indices
next .. next+3, i.e.end-4 .. end-1. Whenend == lenthose indices arelen-4 .. len-1— all in bounds — so the escape is complete and valid; the input is simply unterminated. The check should be>rather than>=: onlyend > lenmeans the escape itself runs past the end of the input. Atend == lencontrol should fall through, the forward scan finds no closing quote, and the existingraise_errmsg("Unterminated string starting at", ...)fires, matching the pure-Python decoder exactly.Why this matters
This is a C-accelerator vs pure-Python parity / wrong-diagnostic bug: the two decoders disagree on the error class for the same input, and the C one points the user at the wrong problem (a non-existent invalid escape rather than the missing closing quote). The
jsonC and pure-Python error messages were deliberately synchronized in bpo-5067, so divergence here regresses that contract.This is distinct from gh-125660 (which is about the pure-Python decoder accepting invalid unicode escapes); here both decoders correctly reject genuinely truncated escapes, and the only divergence is the misclassification of a complete escape at end-of-input.
I have a one-character fix (
>=→>) plus regression tests covering both the C and pure-Python decoders ready to submit.Linked PRs