Optimize matching to match on scalar values when possible #525

rctcwyvrn · 2022-06-30T00:34:56Z

Emit matchScalar instructions when emitting ascii quoted literals, ascii characters, and any scalar, as well as when in unicode scalars mode. matchScalar stores the scalar value inline instead of in a register like the character for match, making it faster and less susceptible to ARC
Add matchScalarBitset to use the optimization from Use a bitset for ascii-only character classes #511 when in unicode scalars mode

Future work: We currently on match on ASCII because we may be given non normalized strings. Two options to deal with this are to A. normalize inputs B. compile two different programs, one optimized for if the input string is normalized and the other not making any assumptions

Large wins in the DNA benchmarks from #515 since they have long quoted literal sequences in their patterns

=== Improvements =====================================================================
- DnaEndsMatchAll                         200ms	288ms	-88.8ms		-30.8%
- DnaMatchAll                             112ms	158ms	-45.9ms		-29.0%
- DnaEndsMatchFirst                       38.2ms	55.3ms	-17.1ms		-30.9%
- LiteralSearchAll                        70.6ms	85.4ms	-14.8ms		-17.3%
- EmailRFCNoMatchesAll                    98.3ms	108ms	-9.45ms		-8.8%
- EmailRFCAll                             42.2ms	51.2ms	-8.98ms		-17.5%
- DnaMatchFirst                           19.5ms	27.9ms	-8.38ms		-30.0%
- EmailLookaheadAll                       74.9ms	82.5ms	-7.63ms		-9.2%
- EmailLookaheadNoMatchesAll              53.8ms	60.5ms	-6.74ms		-11.1%
- ReluctantQuantWithTerminalWhole         8.6ms	10.5ms	-1.87ms		-17.9%
- HtmlAll                                 11.5ms	13.4ms	-1.85ms		-13.8%
- HangulSyllableAll                       6.69ms	8.25ms	-1.56ms		-18.9%
- TaggedEmojisAll                         16.6ms	18.1ms	-1.5ms		-8.3%
- InvertedCCC                             27.8ms	29.1ms	-1.35ms		-4.6%
- CaseInsensitiveCCC                      11.1ms	12ms	-933µs		-7.8%
- HangulSyllableFirst                     3.1ms	3.99ms	-891µs		-22.3%
- CssAll                                  4.05ms	4.85ms	-794µs		-16.4%
- BasicRangeCCC                           10.4ms	11.2ms	-785µs		-7.0%
- BasicCCC                                9.68ms	10.4ms	-742µs		-7.1%
- NotFoundAll                             6.41ms	7.06ms	-654µs		-9.3%

This reverts commit df8919e, reversing changes made to 3a2b324.

milseman · 2022-06-30T16:08:37Z

Sources/_StringProcessing/ByteCodeGen.swift

+    if options.semanticLevel == .graphemeCluster {
+      if options.isCaseInsensitive {
+        // future work: if all ascii, emit matchBitset instructions with
+        // case insensitive bitsets


Other ideas are to have instructions that say to do case-insensitive matching, that is we have variants of match-character/scalar and similarly for match-sequence.

A special case for ASCII could be a dedicated instruction for the parts of ASCII that are case-sensitive. It could store both scalars to match, or just store the lowercase and the engine would do the arithmetic when matching.

I.e. we'll want to get all of this off of using consumer interfaces at some point, but doesn't have to be this PR.

Sources/_StringProcessing/ByteCodeGen.swift

milseman · 2022-06-30T16:23:48Z

Sources/_StringProcessing/ByteCodeGen.swift

+    if optimizationsEnabled && c.isASCII {
+      for scalar in c.unicodeScalars {
+        let boundaryCheck = scalar == c.unicodeScalars.last!
+        builder.buildMatchScalar(scalar, boundaryCheck: boundaryCheck)


Another idea for the builder is to have a convenience function that adds a grapheme boundary check at the current position. It could either update the most recent instruction if that instruction is capable of supporting it or insert the anchor otherwise.

Sources/_StringProcessing/ConsumerInterface.swift

Sources/_StringProcessing/Engine/Processor.swift

milseman · 2022-06-30T16:33:25Z

Sources/_StringProcessing/Engine/Processor.swift

+  mutating func matchScalar(_ s: Unicode.Scalar, boundaryCheck: Bool) -> Bool {
+    guard let curScalar = loadScalar(),
+          curScalar == s,
+          let idx = nextScalarIndex(offsetBy: 1, boundaryCheck: boundaryCheck)


It might make more sense to drop the helper function and just do this directly:

guard s == loadScalar(), !boundaryCheck || isOnGrapheme... else { ... }

Longer term we will want to use a lower-level String interface that returns both the current scalar and the next index. E.g. a nicer version of https://github.com/apple/swift/blob/main/stdlib/public/core/UnicodeHelpers.swift#L107.

Or you could consider adding that now if you like, something like the below and we can micro-optimize it later:

extension String { func _consumeScalar(in bounds: Range<String.Index>) -> (Unicode.Scalar, Index)? { assert(indices.contains(bounds)) ... } }

Sources/_StringProcessing/Regex/DSLTree.swift

rctcwyvrn · 2022-07-04T22:48:49Z

While implementing case insensitive matching instructions I opted for adding new instructions for each new case, however after doing that I benchmarked it and noticed pretty significant regressions

Comparing against benchmark result file before_case_insens_instrs.json
=== Regressions ======================================================================
- HangulSyllableFirst                     3.22ms	3.02ms	195µs		6.4%
- CssAll                                  4.34ms	4.05ms	289µs		7.1%
- HangulSyllableAll                       7.09ms	6.69ms	394µs		5.9%
- EagarQuantWithTerminalWhole             7ms	6.5ms	505µs		7.8%
- NotFoundAll                             6.95ms	6.42ms	535µs		8.3%
- GraphemeBreakNoCapAll                   9.8ms	9.22ms	578µs		6.3%
- BasicRangeCCC                           11.2ms	10.4ms	830µs		8.0%
- TaggedEmojisAll                         17.3ms	16.5ms	833µs		5.0%
- BasicCCC                                10.5ms	9.69ms	852µs		8.8%
- ReluctantQuantWithTerminalWhole         9.47ms	8.6ms	869µs		10.1%
- HtmlAll                                 12.4ms	11.5ms	873µs		7.6%
- IntersectionCCC                         16.2ms	15.3ms	901µs		5.9%
- SubtractionCCC                          15.7ms	14.8ms	936µs		6.3%
- CaseInsensitiveCCC                      12.1ms	11.1ms	1.03ms		9.3%
- symDiffCCC                              39.5ms	38.4ms	1.04ms		2.7%
- AnchoredNotFoundAll                     16ms	14.9ms	1.1ms		7.4%
- ReluctantQuantWhole                     15.2ms	13.9ms	1.27ms		9.1%
- InvertedCCC                             28.7ms	27.3ms	1.36ms		5.0%
- DnaMatchFirst                           21.4ms	19.6ms	1.79ms		9.2%
- EmailRFCAll                             45.5ms	42.2ms	3.28ms		7.8%
- DnaEndsMatchFirst                       41.9ms	38.1ms	3.8ms		10.0%
- EmailRFCNoMatchesAll                    102ms	98.3ms	3.99ms		4.1%
- LiteralSearchNotFoundAll                67.7ms	63.1ms	4.57ms		7.2%
- EmailLookaheadNoMatchesAll              58.8ms	53.8ms	4.94ms		9.2%
- LiteralSearchAll                        75.7ms	70.7ms	5.04ms		7.1%
- EmojiRegexAll                           68.1ms	61.7ms	6.38ms		10.3%
- EmailLookaheadAll                       81.3ms	74.8ms	6.47ms		8.7%
- DnaMatchAll                             123ms	112ms	10.6ms		9.4%
- DnaEndsMatchAll                         222ms	200ms	21.9ms		11.0%

After adjusting the scalar matching instructions to use one instruction with a more complex payload instead of 3 instructions the regression was much less significant

Comparing against benchmark result file before_case_insens_instrs.json
=== Regressions ======================================================================
- CssAll                                  4.08ms	4.05ms	29.7µs		0.7%
- EagarQuantWithTerminalWhole             6.56ms	6.5ms	61.7µs		0.9%
- NotFoundAll                             6.5ms	6.42ms	77.9µs		1.2%
- GraphemeBreakNoCapAll                   9.31ms	9.22ms	94.8µs		1.0%
- HtmlAll                                 11.6ms	11.5ms	101µs		0.9%
- ReluctantQuantWithTerminalWhole         8.75ms	8.6ms	150µs		1.7%
- InvertedCCC                             27.5ms	27.3ms	153µs		0.6%
- ReluctantQuantWhole                     14.1ms	13.9ms	164µs		1.2%
- DnaMatchFirst                           19.7ms	19.6ms	168µs		0.9%
- TaggedEmojisAll                         16.7ms	16.5ms	172µs		1.0%
- IntersectionCCC                         15.5ms	15.3ms	181µs		1.2%
- SubtractionCCC                          15ms	14.8ms	188µs		1.3%
- DnaEndsMatchFirst                       38.4ms	38.1ms	217µs		0.6%
- LiteralSearchAll                        71.4ms	70.7ms	729µs		1.0%
- EmailRFCAll                             43ms	42.2ms	787µs		1.9%
- DnaMatchAll                             114ms	112ms	1.16ms		1.0%
- symDiffCCC                              39.9ms	38.4ms	1.49ms		3.9%
- DnaEndsMatchAll                         202ms	200ms	2.46ms		1.2%
- EmailRFCNoMatchesAll                    102ms	98.3ms	3.96ms		4.0%
=== Improvements =====================================================================
- AnchoredNotFoundAll                     14.7ms	14.9ms	-143µs		-1.0%
- BasicRangeCCC                           10.3ms	10.4ms	-121µs		-1.2%
- BasicCCC                                9.58ms	9.69ms	-116µs		-1.2%
- CaseInsensitiveCCC                      11ms	11.1ms	-99.6µs		-0.9%

In the future we could probably continue to squeeze performance out by creating specialized instructions only in the hottest paths but for now I'll implement everything by adding bits into the payload

Got pretty large improvements from removing 6 instructions (matchScalar unchecked, case insensitive, case insensitive unchecked, match case insensitive, match seq, match scalar bitset). I'll have to rework how CompileTests work though, since it'll have to look in the payloads to differentiate the instructions now

Comparing against benchmark result file matchSeq_and_many_case_insens_instrs.json
=== Improvements =====================================================================
- DnaEndsMatchAll                         205ms	220ms	-15.4ms		-7.0%
- DnaMatchAll                             116ms	123ms	-7.73ms		-6.3%
- EmailLookaheadAll                       75.1ms	81.4ms	-6.26ms		-7.7%
- EmailRFCAll                             42.4ms	48.3ms	-5.91ms		-12.2%
- EmojiRegexAll                           62.4ms	68ms	-5.58ms		-8.2%
- EmailRFCNoMatchesAll                    98.8ms	104ms	-5.49ms		-5.3%

Add case-insensitive match instructions

rctcwyvrn · 2022-07-04T23:25:39Z

Currently CompileTests are broken, I'll fix them tomorrow

rctcwyvrn · 2022-07-07T21:20:59Z

@swift-ci test

Sources/_StringProcessing/ByteCodeGen.swift

milseman · 2022-07-07T23:55:44Z

Sources/_StringProcessing/Engine/InstPayload.swift

+  init(scalar: Unicode.Scalar, caseInsensitive: Bool, boundaryCheck: Bool) {
+    let raw = UInt64(scalar.value)
+      + (caseInsensitive ? 1 << 55: 0)
+      + (boundaryCheck ? 1 << 54 : 0)


Future: we'll be redoing instruction encoding and can clean this up alongside other changes.

milseman · 2022-07-08T00:46:06Z

Tests/RegexTests/CompileTests.swift

+  case captureValue
+  case builtinAssertion
+  case builtinCharacterClass
+}


Future: this will just be replaced with the encoding/decoding infrastructure

Previously we would allow consuming a single scalar in grapheme semantic mode for a DSL `.scalar`. Change this behavior such that it is treated the same as a character with a single scalar. That is: - In grapheme mode it must match an entire grapheme. - In scalar semantic mode, it may match a single scalar.

Convert AST scalars to DSL scalars such that we can preserve the `\u{...}` syntax where the user chooses to use it. This requires fixing up some escaping logic such that we don't escape `\u{...}` sequences.

rctcwyvrn · 2022-07-11T18:25:16Z

@swift-ci test

milseman · 2022-07-11T21:40:25Z

@swift-ci please test

rctcwyvrn · 2022-07-11T23:23:08Z

@swift-ci test

rctcwyvrn · 2022-07-11T23:32:21Z

@swift-ci test

milseman · 2022-07-12T17:19:48Z

Sources/_StringProcessing/ByteCodeGen.swift

+        emitCharacter(Character(s))
+      } else {
+        emitMatchScalar(s)
+      }


This should fail tests, and if it's not we may need to write the tests and XFAIL them. E.g. /a\u{301}/, IIUC, as written would try to make a proper Character from the combining scalar.

rctcwyvrn · 2022-07-12T17:57:16Z

@swift-ci test

rctcwyvrn · 2022-07-12T17:58:13Z

@swift-ci test

rctcwyvrn · 2022-07-12T18:10:36Z

@swift-ci test

) - Adds new instructions for matching characters and scalars case insensitively - Compiles ascii character matches into the faster scalar match instructions even in grapheme semantic mode - Optimizes out unnecessary runtime grapheme boundary checks for all ascii strings - Also includes fixes to scalar matching in grapheme semantic mode (swiftlang#565)

rctcwyvrn and others added 15 commits June 23, 2022 16:28

Add unicode and dna benchmarks

e518bdf

Fixup naming, turn off firstmatch, add benchmark filtering by regex

eb64036

Make --exclude-ns actually work correctly

1237f9a

Add usage comment for generateFasta

fdf2c23

First ver

eeb38e9

Merge branch 'main' into match-scalar

6f76f36

Remove matchseq entirely

3a2b324

Merge branch 'just-one-more-benchmark-suite' into temp

df8919e

Finish up matchScalar

c2ee8cc

Factor out nextScalarIndex for matchBitsetScalar

809b085

Add scalar mode support for matching bitsets + fix bug

06c77c7

Emit matchScalar in quotedLiteral when in unicode scalar mode

e3d7ad7

Add tests

3e1e088

Cleanup

b4c7c8c

Revert "Merge branch 'just-one-more-benchmark-suite' into temp"

5359e31

This reverts commit df8919e, reversing changes made to 3a2b324.

rctcwyvrn requested a review from milseman June 30, 2022 00:40

milseman reviewed Jun 30, 2022

View reviewed changes

rctcwyvrn added 3 commits July 4, 2022 16:18

Cleanup

6e4c2bd

Add case-insensitive match instructions

1a359b4

Add case-insensitive match instructions

Remove extra instructions and use payload bits instead

097ffeb

rctcwyvrn added 3 commits July 4, 2022 16:26

Comment out compiletests for now

76667a2

Fix compile tests

6a1f6e9

Merge branch 'main' into scalar-optimizations-clean

fce8f9a

milseman approved these changes Jul 8, 2022

View reviewed changes

hamishknight and others added 3 commits July 8, 2022 14:57

Preserve scalar syntax in DSL conversion

c6bc811

Convert AST scalars to DSL scalars such that we can preserve the `\u{...}` syntax where the user chooses to use it. This requires fixing up some escaping logic such that we don't escape `\u{...}` sequences.

Change scalar semantics to match swiftlang#565

bca1d2b

rctcwyvrn mentioned this pull request Jul 11, 2022

Fix scalar matching in grapheme semantic mode #565

Closed

rctcwyvrn added 2 commits July 11, 2022 11:37

Merge branch 'not-to-scale' into scalar-optimizations-clean

79e60ac

Add edge case test

22cc9d5

milseman approved these changes Jul 11, 2022

View reviewed changes

Always match .scalar under grapheme semantics

7a9923d

rctcwyvrn added 2 commits July 11, 2022 16:30

Merge branch 'main' into scalar-optimizations-clean

47f8e66

Add new instructions to compile tests

7aa98d1

milseman reviewed Jul 12, 2022

View reviewed changes

Add an XFAIL test for scalar coalescing

113cfe3

Fix XCTExpectFailure for linux

9949b8e

rctcwyvrn merged commit 7752047 into swiftlang:main Jul 12, 2022

rctcwyvrn mentioned this pull request Jul 12, 2022

[5.7] Fix infinite loop and scalar matching in grapheme mode + scalar matching optimizations #569

Merged

hamishknight mentioned this pull request Jul 12, 2022

[\n] should not match \r\n #568

Closed

Optimize matching to match on scalar values when possible #525

Optimize matching to match on scalar values when possible #525

Uh oh!

Conversation

rctcwyvrn commented Jun 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rctcwyvrn commented Jul 4, 2022

Uh oh!

rctcwyvrn commented Jul 4, 2022

Uh oh!

rctcwyvrn commented Jul 7, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rctcwyvrn commented Jul 11, 2022

Uh oh!

milseman commented Jul 11, 2022

Uh oh!

rctcwyvrn commented Jul 11, 2022

Uh oh!

rctcwyvrn commented Jul 11, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rctcwyvrn commented Jul 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rctcwyvrn commented Jul 12, 2022

Uh oh!

rctcwyvrn commented Jul 12, 2022

Uh oh!

Uh oh!

rctcwyvrn commented Jun 30, 2022 •

edited

Loading

rctcwyvrn commented Jul 12, 2022 •

edited

Loading