Skip to content

json: C decoder reports "Invalid \uXXXX escape" instead of "Unterminated string starting at" for a complete \uXXXX escape at end of input #152052

Description

@tonghuaroot

The C accelerator of json reports Invalid \uXXXX escape where the pure-Python decoder reports Unterminated string starting at, for a \uXXXX escape whose final hex digit is the last character of the input.

Minimal repro

The input is an opening quote followed by a complete A escape and no closing quote (7 characters: ", \, u, 0, 0, 4, 1):

>>> import json
>>> json.loads(r'"\u0041')           # C accelerator
Traceback (most recent call last):
  ...
json.decoder.JSONDecodeError: Invalid \uXXXX escape: line 1 column 3 (char 2)

>>> from json import decoder
>>> decoder.py_scanstring(r'"\u0041', 1)   # pure-Python
Traceback (most recent call last):
  ...
json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 1 (char 0)

The four hex digits 0041 form a complete, valid \uXXXX escape, so the string is merely unterminated. The pure-Python decoder diagnoses this correctly; the C accelerator misreports it as an invalid escape.

Root cause

Modules/_json.c, in scanstring_unicode():

end = next + 4;
if (end >= len) {
    raise_errmsg("Invalid \\uXXXX escape", pystr, next - 1);
    goto bail;
}

The four hex digits are read at indices next .. next+3, i.e. end-4 .. end-1. When end == len those indices are len-4 .. len-1 — all in bounds — so the escape is complete and valid; the input is simply unterminated. The check should be > rather than >=: only end > len means the escape itself runs past the end of the input. At end == len control should fall through, the forward scan finds no closing quote, and the existing raise_errmsg("Unterminated string starting at", ...) fires, matching the pure-Python decoder exactly.

Why this matters

This is a C-accelerator vs pure-Python parity / wrong-diagnostic bug: the two decoders disagree on the error class for the same input, and the C one points the user at the wrong problem (a non-existent invalid escape rather than the missing closing quote). The json C and pure-Python error messages were deliberately synchronized in bpo-5067, so divergence here regresses that contract.

This is distinct from gh-125660 (which is about the pure-Python decoder accepting invalid unicode escapes); here both decoders correctly reject genuinely truncated escapes, and the only divergence is the misclassification of a complete escape at end-of-input.

I have a one-character fix (>=>) plus regression tests covering both the C and pure-Python decoders ready to submit.

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    extension-modulesC modules in the Modules dirtype-featureA feature request or enhancement
    No fields configured for issues without a type.

    Projects

    Status
    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions