-
Notifications
You must be signed in to change notification settings - Fork 730
Implement regexp literal syntax checking #2026
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…elines for out-of-range escapes - Check parsed hex code point in regexp escape sequences and emit error if > 0x10FFFF - Add test baselines for unicodeExtendedEscapesInRegularExpressions07 and 12 (ES5/ES6) showing TS1198 for out-of-range values - Remove obsolete .diff baseline artifacts
When reporting "Range out of order in character class", extend the diagnostic length to include a pending low surrogate (using surrogateState.utf8Size). This ensures the error highlight covers the full surrogate pair in non-Unicode mode. Update baselines to reflect the corrected highlight spans.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR updates baseline/reference files for regular expression test cases in the compiler. The changes involve removing .diff files (which previously showed differences between expected and actual outputs) and populating corresponding .errors.txt files with the expected error output. This indicates that the Go port's regular expression error handling has converged with the TypeScript reference implementation.
Key Changes
- Regular expression validation errors are now being correctly reported, matching the TypeScript reference behavior
- Unicode escape sequence validation in regex patterns
- Character class range ordering validation
- Regex flag validation (including unicode flags, duplicate flags, and unknown flags)
Reviewed Changes
Copilot reviewed 57 out of 57 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
Multiple .errors.txt.diff files |
Removed diff files showing previous discrepancies in error reporting |
Multiple .errors.txt files |
Added expected error outputs that now match TypeScript's behavior |
| Files cover various regex features | Unicode escapes, character classes, flags, backreferences, quantifiers, and more |
|
Based on the little differences I've noticed, the testing of this feature is not entirely complete... |
|
@graphemecluster I'm not sure if you are around and willing to look at this, but since you are looking at regex stuff again maybe you're willing to look at this again. Not 100% confident in it, and I don't like the duplication due to the fact that we need some weird scanning stuff to deal with things the new scanner does not much care about... |
|
|
||
| // Table 67: Binary Unicode property aliases and their canonical property names | ||
| // https://tc39.es/ecma262/#table-binary-unicode-properties | ||
| var binaryUnicodeProperties = collections.NewSetFromItems( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's annoying that we can't just use unicode.Categories for this because of differences like Bidi_Control and Bidi_C existing.
This PR implements regexp literal syntax checking. The main difference is that it is no longer a scanning step, as that would make it a parser error, which would mean we would have to go back to having separate ASTs depending on the language version being used (something we worked so hard to stop doing).
Instead, the regex checks are grammar checks that happen during checking. Additionally, the code doesn't live in the scanner package at all, as it is entirely self-contained (much to the original code's credit).
This is a really complicated piece of code largely due to the sheer scale of these checks, but also the annoying edge cases due to the old compiler's reliance of TS being in JS already and therefore getting identical string behavior when decoding. This required just a load of fiddling to get working.
The bulk of this is copilot ported, with a bunch of cleanups and fixes. The result is that every regexp related test matches, except for
regularExpressionCharacterClassRangeOrder.errors.txt.diffandregularExpressionWithNonBMPFlags.errors.txt.diff, which I believe are improving because of better handling of positioning in weird characters, e.g.:// See https://en.wikipedia.org/wiki/Mathematical_Alphanumeric_Symbols const regexes: RegExp[] = [ /[𝘈-𝘡][𝘡-𝘈]/, - ~~~ + ~~~ !!! error TS1517: Range out of order in character class. - ~~~ + ~~~ !!! error TS1517: Range out of order in character class. /[𝘈-𝘡][𝘡-𝘈]/u, - ~~~~~ + ~~~ !!! error TS1517: Range out of order in character class. /[𝘈-𝘡][𝘡-𝘈]/v, - ~~~~~ + ~~~ !!! error TS1517: Range out of order in character class. /[\u{1D608}-\u{1D621}][\u{1D621}-\u{1D608}]/,But probably aren't really, it's just that the offsets are UTF-8 now and these are weird chars. tsserver in the editor shows similar ranges so this was probably just a bug in the old baseline squiggler.