-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize matching to match on scalar values when possible #525
Optimize matching to match on scalar values when possible #525
Conversation
if options.semanticLevel == .graphemeCluster { | ||
if options.isCaseInsensitive { | ||
// future work: if all ascii, emit matchBitset instructions with | ||
// case insensitive bitsets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other ideas are to have instructions that say to do case-insensitive matching, that is we have variants of match-character/scalar and similarly for match-sequence.
A special case for ASCII could be a dedicated instruction for the parts of ASCII that are case-sensitive. It could store both scalars to match, or just store the lowercase and the engine would do the arithmetic when matching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I.e. we'll want to get all of this off of using consumer interfaces at some point, but doesn't have to be this PR.
if optimizationsEnabled && c.isASCII { | ||
for scalar in c.unicodeScalars { | ||
let boundaryCheck = scalar == c.unicodeScalars.last! | ||
builder.buildMatchScalar(scalar, boundaryCheck: boundaryCheck) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another idea for the builder is to have a convenience function that adds a grapheme boundary check at the current position. It could either update the most recent instruction if that instruction is capable of supporting it or insert the anchor otherwise.
mutating func matchScalar(_ s: Unicode.Scalar, boundaryCheck: Bool) -> Bool { | ||
guard let curScalar = loadScalar(), | ||
curScalar == s, | ||
let idx = nextScalarIndex(offsetBy: 1, boundaryCheck: boundaryCheck) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might make more sense to drop the helper function and just do this directly:
guard s == loadScalar(), !boundaryCheck || isOnGrapheme... else { ... }
Longer term we will want to use a lower-level String interface that returns both the current scalar and the next index. E.g. a nicer version of https://github.com/apple/swift/blob/main/stdlib/public/core/UnicodeHelpers.swift#L107.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or you could consider adding that now if you like, something like the below and we can micro-optimize it later:
extension String {
func _consumeScalar(in bounds: Range<String.Index>) -> (Unicode.Scalar, Index)? {
assert(indices.contains(bounds))
...
}
}
While implementing case insensitive matching instructions I opted for adding new instructions for each new case, however after doing that I benchmarked it and noticed pretty significant regressions
After adjusting the scalar matching instructions to use one instruction with a more complex payload instead of 3 instructions the regression was much less significant
In the future we could probably continue to squeeze performance out by creating specialized instructions only in the hottest paths but for now I'll implement everything by adding bits into the payload Got pretty large improvements from removing 6 instructions (matchScalar unchecked, case insensitive, case insensitive unchecked, match case insensitive, match seq, match scalar bitset). I'll have to rework how
|
Add case-insensitive match instructions
Currently |
@swift-ci test |
init(scalar: Unicode.Scalar, caseInsensitive: Bool, boundaryCheck: Bool) { | ||
let raw = UInt64(scalar.value) | ||
+ (caseInsensitive ? 1 << 55: 0) | ||
+ (boundaryCheck ? 1 << 54 : 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Future: we'll be redoing instruction encoding and can clean this up alongside other changes.
case captureValue | ||
case builtinAssertion | ||
case builtinCharacterClass | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Future: this will just be replaced with the encoding/decoding infrastructure
Previously we would allow consuming a single scalar in grapheme semantic mode for a DSL `.scalar`. Change this behavior such that it is treated the same as a character with a single scalar. That is: - In grapheme mode it must match an entire grapheme. - In scalar semantic mode, it may match a single scalar.
Convert AST scalars to DSL scalars such that we can preserve the `\u{...}` syntax where the user chooses to use it. This requires fixing up some escaping logic such that we don't escape `\u{...}` sequences.
@swift-ci test |
@swift-ci please test |
@swift-ci test |
@swift-ci test |
emitCharacter(Character(s)) | ||
} else { | ||
emitMatchScalar(s) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should fail tests, and if it's not we may need to write the tests and XFAIL them. E.g. /a\u{301}/
, IIUC, as written would try to make a proper Character from the combining scalar.
@swift-ci test |
1 similar comment
@swift-ci test |
@swift-ci test |
) - Adds new instructions for matching characters and scalars case insensitively - Compiles ascii character matches into the faster scalar match instructions even in grapheme semantic mode - Optimizes out unnecessary runtime grapheme boundary checks for all ascii strings - Also includes fixes to scalar matching in grapheme semantic mode (swiftlang#565)
) - Adds new instructions for matching characters and scalars case insensitively - Compiles ascii character matches into the faster scalar match instructions even in grapheme semantic mode - Optimizes out unnecessary runtime grapheme boundary checks for all ascii strings - Also includes fixes to scalar matching in grapheme semantic mode (swiftlang#565)
) - Adds new instructions for matching characters and scalars case insensitively - Compiles ascii character matches into the faster scalar match instructions even in grapheme semantic mode - Optimizes out unnecessary runtime grapheme boundary checks for all ascii strings - Also includes fixes to scalar matching in grapheme semantic mode (swiftlang#565)
) - Adds new instructions for matching characters and scalars case insensitively - Compiles ascii character matches into the faster scalar match instructions even in grapheme semantic mode - Optimizes out unnecessary runtime grapheme boundary checks for all ascii strings - Also includes fixes to scalar matching in grapheme semantic mode (swiftlang#565)
) - Adds new instructions for matching characters and scalars case insensitively - Compiles ascii character matches into the faster scalar match instructions even in grapheme semantic mode - Optimizes out unnecessary runtime grapheme boundary checks for all ascii strings - Also includes fixes to scalar matching in grapheme semantic mode (swiftlang#565)
matchScalar
instructions when emitting ascii quoted literals, ascii characters, and any scalar, as well as when in unicode scalars mode.matchScalar
stores the scalar value inline instead of in a register like the character formatch
, making it faster and less susceptible to ARCmatchScalarBitset
to use the optimization from Use a bitset for ascii-only character classes #511 when in unicode scalars modeFuture work: We currently on match on ASCII because we may be given non normalized strings. Two options to deal with this are to A. normalize inputs B. compile two different programs, one optimized for if the input string is normalized and the other not making any assumptions
Large wins in the DNA benchmarks from #515 since they have long quoted literal sequences in their patterns