Skip to content

Commit 836f327

Browse files
milsemannatecook1000finagolfinoleamartini51
authored
Update swift/main with recent changes (#651)
* Atomically load the lowered program (#610) Since we're atomically initializing the compiled program in `Regex.Program`, we need to pair that with an atomic load. Resolves #609. * Add tests for line start/end word boundary diffs (#616) The `default` and `simple` word boundaries have different behaviors at the start and end of strings/lines. These tests validate that we have the correct behavior implemented. Related to issue #613. * Add tweaks for Android * Fix documentation typo (#615) * Fix abstract for Regex.dotMatchesNewlines(_:). (#614) The old version looks like it was accidentally duplicated from anchorsMatchLineEndings(_:) just below it. * Remove `RegexConsumer` and fix its dependencies (#617) * Remove `RegexConsumer` and fix its dependencies This eliminates the RegexConsumer type and rewrites its users to call through to other, existing functionality on Regex or in the Algorithms implementations. RegexConsumer doesn't take account of the dual subranges required for matching, so it can produce results that are inconsistent with matches(of:) and ranges(of:), which were rewritten earlier. rdar://102841216 * Remove remaining from-end algorithm methods This removes methods that are left over from when we were considering from-end algorithms. These aren't tested and may not have the correct semantics, so it's safer to remove them entirely. * Improve StringProcessing and RegexBuilder documentation (#611) This includes documentation improvements for core types/methods, RegexBuilder types along with their generated variadic initializers, and adds some curation. It also includes tests of the documentation code samples. * Set availability for inverted character class test (#621) This feature depends on running with a Swift 5.7 stdlib, and fails when that isn't available. * Add type annotations in RegexBuilder tests These changes work around a change to the way result builders are compiled that removes the ability for result builder closure outputs to affect the overload resolution elsewhere in an expression. Workarounds for rdar://104881395 and rdar://104645543 * Workaround for fileprivate array issue A recent compiler change results in fileprivate arrays sometimes not keeping their buffers around long enough. This change avoids that issue by removing the fileprivate annotations from the affected type. * Fix an issue where named character classes weren't getting converted in the result builder. <rdar://104480703> * Stop at end of search string in TwoWaySearcher (#631) When searching for a substring that doesn't exist, it was possible for TwoWaySearcher to advance beyond the end of the search string, causing a crash. This change adds a `limitedBy:` parameter to that index movement, avoiding the invalid movement. Fixes rdar://105154010 * Correct misspelling in DSL renderer (#627) vertial -> vertical rdar://104602317 * Fix output type mismatch with RegexBuilder (#626) Some regex literals (and presumably other `Regex` instances) lose their output type information when used in a RegexBuilder closure due to the way the concatenating builder calls are overloaded. In particular, any output type with labeled tuples or where the sum of tuple components in the accumulated and new output types is greater than 10 will be ignored. Regex internals don't make this distinction, however, so there ends up being a mismatch between what a `Regex.Match` instance tries to produce and the output type of the outermost regex. For example, this code results in a crash, because `regex` is a `Regex<Substring>` but the match tries to produce a `(Substring, number: Substring)`: let regex = Regex { ZeroOrMore(.whitespace) /:(?<number>\d+):/ ZeroOrMore(.whitespace) } let match = try regex.wholeMatch(in: " :21: ") print(match!.output) To fix this, we add a new `ignoreCapturesInTypedOutput` DSLTree node to mark situations where the output type is discarded. This status is propagated through the capture list into the match's storage, which lets us produce the correct output type. Note that we can't just drop the capture groups when building the compiled program because (1) different parts of the regex might reference the capture group and (2) all capture groups are available if a developer converts the output to `AnyRegexOutput`. let anyOutput = AnyRegexOutput(match) // anyOutput[1] == "21" // anyOutput["number"] == Optional("21") Fixes #625. rdar://104823356 Note: Linux seems to crash on different tests when the two customTest overloads have `internal` visibility or are called. Switching one of the functions to be generic over a RegexComponent works around the issue. * Revert "Merge pull request #628 from apple/result_builder_changes_workaround" This reverts commit 7e059b7, reversing changes made to 3ca8b13. * Use `some` syntax in variadics This supports a type checker fix after the change in how result builder closure parameters are type-checked. * Type checker workaround: adjust test * Further refactor to work around type checker regression * Align availability macro with OS versions (#641) * Speed up general character class matching (#642) Short-circuit Character.isASCII checks inside built in character class matching. Also, make benchmark try a few more times before giving up. * Test for \s matching CRLF when scalar matching (#648) * General ascii fast paths for character classes (#644) General ASCII fast-paths for builtin character classes * Remove the unsupported `anyScalar` case (#650) We decided not to support the `anyScalar` character class, which would match a single Unicode scalar regardless of matching mode. However, its representation was still included in the various character class types in the regex engine, leading to unreachable code and unclear requirements when changing or adding new code. This change removes that representation where possible. The `DSLTree.Atom.CharacterClass` enum is left unchanged, since it is marked `@_spi(RegexBuilder) public`. Any use of that enum case is handled with a `fatalError("Unsupported")`, and it isn't produced on any code path. --------- Co-authored-by: Nate Cook <[email protected]> Co-authored-by: Butta <[email protected]> Co-authored-by: Ole Begemann <[email protected]> Co-authored-by: Alex Martini <[email protected]> Co-authored-by: Alejandro Alonso <[email protected]> Co-authored-by: David Ewing <[email protected]> Co-authored-by: Dave Ewing <[email protected]>
1 parent 4121708 commit 836f327

19 files changed

+432
-218
lines changed

Diff for: Documentation/ProgrammersManual.md

+30
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
# Programmer's Manual
2+
3+
## Programming patterns
4+
5+
### Engine quick checks and fast paths
6+
7+
In the engine nomenclature, a quick-check results in a yes/no/maybe while a thorough check always results in a definite answer.
8+
9+
The nature of quick checks and fast paths is that they bifurcate testing coverage. One easy way to prevent this in simple cases is to assert that a definite quick result matches the thorough result.
10+
11+
One example of this pattern is matching against a builtin character class. The engine has a `_matchBuiltinCC`
12+
13+
```swift
14+
func _matchBuiltinCC(...) -> Input.Index? {
15+
// Calls _quickMatchBuiltinCC, if that gives a definite result
16+
// asserts that it is the same as the result of
17+
// _thoroughMatchBuiltinCC and returns it. Otherwise returns the
18+
// result of _thoroughMatchBuiltinCC
19+
}
20+
21+
@inline(__always)
22+
func _quickMatchBuiltinCC(...) -> QuickResult<Input.Index?>
23+
24+
@inline(never)
25+
func _thoroughMatchBuiltinCC(...) -> Input.Index?
26+
```
27+
28+
The thorough check is never inlined, as it is a lot of cold code. Note that quick and thorough functions should be pure, that is they shouldn't update processor state.
29+
30+

Diff for: Package.swift

+1-1
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ let availabilityDefinition = PackageDescription.SwiftSetting.unsafeFlags([
77
"-Xfrontend",
88
"-define-availability",
99
"-Xfrontend",
10-
"SwiftStdlib 5.7:macOS 9999, iOS 9999, watchOS 9999, tvOS 9999",
10+
"SwiftStdlib 5.7:macOS 13.0, iOS 16.0, watchOS 9.0, tvOS 16.0",
1111
"-Xfrontend",
1212
"-define-availability",
1313
"-Xfrontend",

Diff for: Sources/RegexBenchmark/BenchmarkRunner.swift

+13-5
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,9 @@
11
import Foundation
22
@_spi(RegexBenchmark) import _StringProcessing
33

4+
/// The number of times to re-run the benchmark if results are too varying
5+
private var rerunCount: Int { 3 }
6+
47
struct BenchmarkRunner {
58
let suiteName: String
69
var suite: [any RegexBenchmark] = []
@@ -82,11 +85,16 @@ struct BenchmarkRunner {
8285
for b in suite {
8386
var result = measure(benchmark: b, samples: samples)
8487
if result.runtimeIsTooVariant {
85-
print("Warning: Standard deviation > \(Stats.maxAllowedStdev*100)% for \(b.name)")
86-
print(result.runtime)
87-
print("Rerunning \(b.name)")
88-
result = measure(benchmark: b, samples: result.runtime.samples*2)
89-
print(result.runtime)
88+
for _ in 0..<rerunCount {
89+
print("Warning: Standard deviation > \(Stats.maxAllowedStdev*100)% for \(b.name)")
90+
print(result.runtime)
91+
print("Rerunning \(b.name)")
92+
result = measure(benchmark: b, samples: result.runtime.samples*2)
93+
print(result.runtime)
94+
if !result.runtimeIsTooVariant {
95+
break
96+
}
97+
}
9098
if result.runtimeIsTooVariant {
9199
fatalError("Benchmark \(b.name) is too variant")
92100
}

Diff for: Sources/TestSupport/TestSupport.swift

+1-1
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ import XCTest
1515
// *without* `-disable-availability-checking` to ensure the #available check is
1616
// not compiled into a no-op.
1717

18-
#if os(Linux)
18+
#if os(Linux) || os(Android)
1919
public func XCTExpectFailure(
2020
_ message: String? = nil, body: () throws -> Void
2121
) rethrows {}

Diff for: Sources/VariadicsGenerator/VariadicsGenerator.swift

+1-1
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
import ArgumentParser
1515
#if os(macOS)
1616
import Darwin
17-
#elseif os(Linux)
17+
#elseif canImport(Glibc)
1818
import Glibc
1919
#elseif os(Windows)
2020
import CRT

Diff for: Sources/_StringProcessing/ByteCodeGen.swift

-3
Original file line numberDiff line numberDiff line change
@@ -702,9 +702,6 @@ fileprivate extension Compiler.ByteCodeGen {
702702
case .characterClass(let cc):
703703
// Custom character class that consumes a single grapheme
704704
let model = cc.asRuntimeModel(options)
705-
guard model.consumesSingleGrapheme else {
706-
return false
707-
}
708705
builder.buildQuantify(
709706
model: model,
710707
kind,

Diff for: Sources/_StringProcessing/CMakeLists.txt

+1
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,7 @@ add_library(_StringProcessing
4747
Regex/DSLTree.swift
4848
Regex/Match.swift
4949
Regex/Options.swift
50+
Unicode/ASCII.swift
5051
Unicode/CaseConversion.swift
5152
Unicode/CharacterProps.swift
5253
Unicode/Comparison.swift

Diff for: Sources/_StringProcessing/Engine/MEBuiltins.swift

+152-99
Original file line numberDiff line numberDiff line change
@@ -9,114 +9,26 @@ extension Character {
99
}
1010

1111
extension Processor {
12-
mutating func matchBuiltin(
12+
mutating func matchBuiltinCC(
1313
_ cc: _CharacterClassModel.Representation,
14-
_ isInverted: Bool,
15-
_ isStrictASCII: Bool,
16-
_ isScalarSemantics: Bool
14+
isInverted: Bool,
15+
isStrictASCII: Bool,
16+
isScalarSemantics: Bool
1717
) -> Bool {
18-
guard let next = _doMatchBuiltin(
18+
guard let next = input._matchBuiltinCC(
1919
cc,
20-
isInverted,
21-
isStrictASCII,
22-
isScalarSemantics
20+
at: currentPosition,
21+
isInverted: isInverted,
22+
isStrictASCII: isStrictASCII,
23+
isScalarSemantics: isScalarSemantics
2324
) else {
2425
signalFailure()
2526
return false
2627
}
2728
currentPosition = next
2829
return true
2930
}
30-
31-
func _doMatchBuiltin(
32-
_ cc: _CharacterClassModel.Representation,
33-
_ isInverted: Bool,
34-
_ isStrictASCII: Bool,
35-
_ isScalarSemantics: Bool
36-
) -> Input.Index? {
37-
guard let char = load(), let scalar = loadScalar() else {
38-
return nil
39-
}
4031

41-
let asciiCheck = (char.isASCII && !isScalarSemantics)
42-
|| (scalar.isASCII && isScalarSemantics)
43-
|| !isStrictASCII
44-
45-
var matched: Bool
46-
var next: Input.Index
47-
switch (isScalarSemantics, cc) {
48-
case (_, .anyGrapheme):
49-
next = input.index(after: currentPosition)
50-
case (_, .anyScalar):
51-
next = input.unicodeScalars.index(after: currentPosition)
52-
case (true, _):
53-
next = input.unicodeScalars.index(after: currentPosition)
54-
case (false, _):
55-
next = input.index(after: currentPosition)
56-
}
57-
58-
switch cc {
59-
case .any, .anyGrapheme:
60-
matched = true
61-
case .anyScalar:
62-
if isScalarSemantics {
63-
matched = true
64-
} else {
65-
matched = input.isOnGraphemeClusterBoundary(next)
66-
}
67-
case .digit:
68-
if isScalarSemantics {
69-
matched = scalar.properties.numericType != nil && asciiCheck
70-
} else {
71-
matched = char.isNumber && asciiCheck
72-
}
73-
case .horizontalWhitespace:
74-
if isScalarSemantics {
75-
matched = scalar.isHorizontalWhitespace && asciiCheck
76-
} else {
77-
matched = char._isHorizontalWhitespace && asciiCheck
78-
}
79-
case .verticalWhitespace:
80-
if isScalarSemantics {
81-
matched = scalar.isNewline && asciiCheck
82-
} else {
83-
matched = char._isNewline && asciiCheck
84-
}
85-
case .newlineSequence:
86-
if isScalarSemantics {
87-
matched = scalar.isNewline && asciiCheck
88-
if matched && scalar == "\r"
89-
&& next != input.endIndex && input.unicodeScalars[next] == "\n" {
90-
// Match a full CR-LF sequence even in scalar semantics
91-
input.unicodeScalars.formIndex(after: &next)
92-
}
93-
} else {
94-
matched = char._isNewline && asciiCheck
95-
}
96-
case .whitespace:
97-
if isScalarSemantics {
98-
matched = scalar.properties.isWhitespace && asciiCheck
99-
} else {
100-
matched = char.isWhitespace && asciiCheck
101-
}
102-
case .word:
103-
if isScalarSemantics {
104-
matched = scalar.properties.isAlphabetic && asciiCheck
105-
} else {
106-
matched = char.isWordCharacter && asciiCheck
107-
}
108-
}
109-
110-
if isInverted {
111-
matched.toggle()
112-
}
113-
114-
guard matched else {
115-
return nil
116-
}
117-
return next
118-
}
119-
12032
func isAtStartOfLine(_ payload: AssertionPayload) -> Bool {
12133
if currentPosition == subjectBounds.lowerBound { return true }
12234
switch payload.semanticLevel {
@@ -126,7 +38,7 @@ extension Processor {
12638
return input.unicodeScalars[input.unicodeScalars.index(before: currentPosition)].isNewline
12739
}
12840
}
129-
41+
13042
func isAtEndOfLine(_ payload: AssertionPayload) -> Bool {
13143
if currentPosition == subjectBounds.upperBound { return true }
13244
switch payload.semanticLevel {
@@ -169,7 +81,7 @@ extension Processor {
16981
return isAtStartOfLine(payload)
17082
case .endOfLine:
17183
return isAtEndOfLine(payload)
172-
84+
17385
case .caretAnchor:
17486
if payload.anchorsMatchNewlines {
17587
return isAtStartOfLine(payload)
@@ -202,3 +114,144 @@ extension Processor {
202114
}
203115
}
204116
}
117+
118+
// MARK: Built-in character class matching
119+
120+
extension String {
121+
122+
// Mentioned in ProgrammersManual.md, update docs if redesigned
123+
func _matchBuiltinCC(
124+
_ cc: _CharacterClassModel.Representation,
125+
at currentPosition: String.Index,
126+
isInverted: Bool,
127+
isStrictASCII: Bool,
128+
isScalarSemantics: Bool
129+
) -> String.Index? {
130+
guard currentPosition < endIndex else {
131+
return nil
132+
}
133+
if case .definite(let result) = _quickMatchBuiltinCC(
134+
cc,
135+
at: currentPosition,
136+
isInverted: isInverted,
137+
isStrictASCII: isStrictASCII,
138+
isScalarSemantics: isScalarSemantics
139+
) {
140+
assert(result == _thoroughMatchBuiltinCC(
141+
cc,
142+
at: currentPosition,
143+
isInverted: isInverted,
144+
isStrictASCII: isStrictASCII,
145+
isScalarSemantics: isScalarSemantics))
146+
return result
147+
}
148+
return _thoroughMatchBuiltinCC(
149+
cc,
150+
at: currentPosition,
151+
isInverted: isInverted,
152+
isStrictASCII: isStrictASCII,
153+
isScalarSemantics: isScalarSemantics)
154+
}
155+
156+
// Mentioned in ProgrammersManual.md, update docs if redesigned
157+
@inline(__always)
158+
func _quickMatchBuiltinCC(
159+
_ cc: _CharacterClassModel.Representation,
160+
at currentPosition: String.Index,
161+
isInverted: Bool,
162+
isStrictASCII: Bool,
163+
isScalarSemantics: Bool
164+
) -> QuickResult<String.Index?> {
165+
assert(currentPosition < endIndex)
166+
guard let (next, result) = _quickMatch(
167+
cc, at: currentPosition, isScalarSemantics: isScalarSemantics
168+
) else {
169+
return .unknown
170+
}
171+
return .definite(result == isInverted ? nil : next)
172+
}
173+
174+
// Mentioned in ProgrammersManual.md, update docs if redesigned
175+
@inline(never)
176+
func _thoroughMatchBuiltinCC(
177+
_ cc: _CharacterClassModel.Representation,
178+
at currentPosition: String.Index,
179+
isInverted: Bool,
180+
isStrictASCII: Bool,
181+
isScalarSemantics: Bool
182+
) -> String.Index? {
183+
assert(currentPosition < endIndex)
184+
let char = self[currentPosition]
185+
let scalar = unicodeScalars[currentPosition]
186+
187+
let asciiCheck = !isStrictASCII
188+
|| (scalar.isASCII && isScalarSemantics)
189+
|| char.isASCII
190+
191+
var matched: Bool
192+
var next: String.Index
193+
switch (isScalarSemantics, cc) {
194+
case (_, .anyGrapheme):
195+
next = index(after: currentPosition)
196+
case (true, _):
197+
next = unicodeScalars.index(after: currentPosition)
198+
case (false, _):
199+
next = index(after: currentPosition)
200+
}
201+
202+
switch cc {
203+
case .any, .anyGrapheme:
204+
matched = true
205+
case .digit:
206+
if isScalarSemantics {
207+
matched = scalar.properties.numericType != nil && asciiCheck
208+
} else {
209+
matched = char.isNumber && asciiCheck
210+
}
211+
case .horizontalWhitespace:
212+
if isScalarSemantics {
213+
matched = scalar.isHorizontalWhitespace && asciiCheck
214+
} else {
215+
matched = char._isHorizontalWhitespace && asciiCheck
216+
}
217+
case .verticalWhitespace:
218+
if isScalarSemantics {
219+
matched = scalar.isNewline && asciiCheck
220+
} else {
221+
matched = char._isNewline && asciiCheck
222+
}
223+
case .newlineSequence:
224+
if isScalarSemantics {
225+
matched = scalar.isNewline && asciiCheck
226+
if matched && scalar == "\r"
227+
&& next != endIndex && unicodeScalars[next] == "\n" {
228+
// Match a full CR-LF sequence even in scalar semantics
229+
unicodeScalars.formIndex(after: &next)
230+
}
231+
} else {
232+
matched = char._isNewline && asciiCheck
233+
}
234+
case .whitespace:
235+
if isScalarSemantics {
236+
matched = scalar.properties.isWhitespace && asciiCheck
237+
} else {
238+
matched = char.isWhitespace && asciiCheck
239+
}
240+
case .word:
241+
if isScalarSemantics {
242+
matched = scalar.properties.isAlphabetic && asciiCheck
243+
} else {
244+
matched = char.isWordCharacter && asciiCheck
245+
}
246+
}
247+
248+
if isInverted {
249+
matched.toggle()
250+
}
251+
252+
guard matched else {
253+
return nil
254+
}
255+
return next
256+
}
257+
}

0 commit comments

Comments
 (0)