Implement regexp literal syntax checking #2026

jakebailey · 2025-11-05T21:10:26Z

This PR implements regexp literal syntax checking. The main difference is that it is no longer a scanning step, as that would make it a parser error, which would mean we would have to go back to having separate ASTs depending on the language version being used (something we worked so hard to stop doing).

Instead, the regex checks are grammar checks that happen during checking. Additionally, the code doesn't live in the scanner package at all, as it is entirely self-contained (much to the original code's credit).

This is a really complicated piece of code largely due to the sheer scale of these checks, but also the annoying edge cases due to the old compiler's reliance of TS being in JS already and therefore getting identical string behavior when decoding. This required just a load of fiddling to get working.

The bulk of this is copilot ported, with a bunch of cleanups and fixes. The result is that every regexp related test matches, except for regularExpressionCharacterClassRangeOrder.errors.txt.diff and regularExpressionWithNonBMPFlags.errors.txt.diff, which I believe are improving because of better handling of positioning in weird characters, e.g.:

     // See https://en.wikipedia.org/wiki/Mathematical_Alphanumeric_Symbols
     const regexes: RegExp[] = [
     	/[𝘈-𝘡][𝘡-𝘈]/,
-    	   ~~~
+    	  ~~~
 !!! error TS1517: Range out of order in character class.
-    	          ~~~
+    	       ~~~
 !!! error TS1517: Range out of order in character class.
     	/[𝘈-𝘡][𝘡-𝘈]/u,
-    	         ~~~~~
+    	       ~~~
 !!! error TS1517: Range out of order in character class.
     	/[𝘈-𝘡][𝘡-𝘈]/v,
-    	         ~~~~~
+    	       ~~~
 !!! error TS1517: Range out of order in character class.
     
     	/[\u{1D608}-\u{1D621}][\u{1D621}-\u{1D608}]/,

But probably aren't really, it's just that the offsets are UTF-8 now and these are weird chars. tsserver in the editor shows similar ranges so this was probably just a bug in the old baseline squiggler.

…elines for out-of-range escapes - Check parsed hex code point in regexp escape sequences and emit error if > 0x10FFFF - Add test baselines for unicodeExtendedEscapesInRegularExpressions07 and 12 (ES5/ES6) showing TS1198 for out-of-range values - Remove obsolete .diff baseline artifacts

When reporting "Range out of order in character class", extend the diagnostic length to include a pending low surrogate (using surrogateState.utf8Size). This ensures the error highlight covers the full surrogate pair in non-Unicode mode. Update baselines to reflect the corrected highlight spans.

Copilot

Pull Request Overview

This PR updates baseline/reference files for regular expression test cases in the compiler. The changes involve removing .diff files (which previously showed differences between expected and actual outputs) and populating corresponding .errors.txt files with the expected error output. This indicates that the Go port's regular expression error handling has converged with the TypeScript reference implementation.

Key Changes

Regular expression validation errors are now being correctly reported, matching the TypeScript reference behavior
Unicode escape sequence validation in regex patterns
Character class range ordering validation
Regex flag validation (including unicode flags, duplicate flags, and unknown flags)

Reviewed Changes

Copilot reviewed 57 out of 57 changed files in this pull request and generated no comments.

File	Description
Multiple `.errors.txt.diff` files	Removed diff files showing previous discrepancies in error reporting
Multiple `.errors.txt` files	Added expected error outputs that now match TypeScript's behavior
Files cover various regex features	Unicode escapes, character classes, flags, backreferences, quantifiers, and more

jakebailey · 2025-11-05T22:58:42Z

Based on the little differences I've noticed, the testing of this feature is not entirely complete...

jakebailey · 2025-11-06T22:27:36Z

@graphemecluster I'm not sure if you are around and willing to look at this, but since you are looking at regex stuff again maybe you're willing to look at this again.

Not 100% confident in it, and I don't like the duplication due to the fact that we need some weird scanning stuff to deal with things the new scanner does not much care about...

DanielRosenwasser · 2025-11-06T23:09:19Z

internal/regexpchecker/tables.go

+
+// Table 67: Binary Unicode property aliases and their canonical property names
+// https://tc39.es/ecma262/#table-binary-unicode-properties
+var binaryUnicodeProperties = collections.NewSetFromItems(


It's annoying that we can't just use unicode.Categories for this because of differences like Bidi_Control and Bidi_C existing.

jakebailey added 30 commits November 4, 2025 19:10

wip

40ee30b

wip

1af4d7c

wip

1cef046

wip

3110ac6

wip

b648805

wip

0470fb4

wip

899ad1d

wip

a5a1597

move code

9a4c582

wip

918571e

wip

9c50cd8

Simplify

6b788d2

Simplify

e1c3215

Simplify

1831e7b

Delete temp test

e944385

Remove helper

2d4159c

refactor

d6a8a5c

refactor

b36d027

Simplify

d878c54

more refactors

24e7ea1

move code around

6e230fb

use a set

1ab893a

Small simplifications

fd90e1b

Rename func

0c95bf0

cleanups

3450dae

Move utf16 junk out

fdb5b71

Defer to utf16 package where possible

cb08daf

Remove unused method

b63d61b

Copilot AI review requested due to automatic review settings November 5, 2025 21:10

Copilot AI reviewed Nov 5, 2025

View reviewed changes

jakebailey added 3 commits November 5, 2025 14:09

Fix named capture group stuff

0604d23

Fix porting difference

082cbcc

fix escaping

320e435

DanielRosenwasser reviewed Nov 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement regexp literal syntax checking #2026

Implement regexp literal syntax checking #2026

Uh oh!

jakebailey commented Nov 5, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

jakebailey commented Nov 5, 2025 •

edited

Loading

Uh oh!

jakebailey commented Nov 6, 2025

Uh oh!

DanielRosenwasser Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Implement regexp literal syntax checking #2026

Are you sure you want to change the base?

Implement regexp literal syntax checking #2026

Uh oh!

Conversation

jakebailey commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Key Changes

Reviewed Changes

Uh oh!

jakebailey commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jakebailey commented Nov 6, 2025

Uh oh!

DanielRosenwasser Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jakebailey commented Nov 5, 2025 •

edited

Loading

jakebailey commented Nov 5, 2025 •

edited

Loading