Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Extend support for static string matching on regex_replace #12023

Open
revans2 opened this issue Jan 24, 2025 · 1 comment
Open

[FEA] Extend support for static string matching on regex_replace #12023

revans2 opened this issue Jan 24, 2025 · 1 comment
Labels
feature request New feature or request performance A performance related task/issue

Comments

@revans2
Copy link
Collaborator

revans2 commented Jan 24, 2025

Is your feature request related to a problem? Please describe.
When we convert a regex_replace call to run on the GPU we parse the regular expression and transpile it to be able to run on the CUDF regular expression engine. We also include a performance optimization where we can look to see if we could extract out a list of static strings to match instead, because this tends to be much more efficient.

The code for this is very restrictive because it relies on converting part of the regular expression back to a string and checking it against a list of known bad character sequences.

def isSupportedStringReplacePattern(strLit: String): Boolean = {
// check for regex special characters, except for \u0000, \n, \r, and \t which we can support
val supported = Seq("\u0000", "\n", "\r", "\t")
!regexList.filterNot(supported.contains(_)).exists(pattern => strLit.contains(pattern))
}

private[this] lazy val regexList: Seq[String] = Seq("\\", "\u0000", "\\x", "\t", "\n", "\r",
"\f", "\\a", "\\e", "\\cx", "[", "]", "^", "&", ".", "*", "\\d", "\\D", "\\h", "\\H", "\\s",
"\\S", "\\v", "\\V", "\\w", "\\w", "\\p", "$", "\\b", "\\B", "\\A", "\\G", "\\Z", "\\z", "\\R",
"?", "|", "(", ")", "{", "}", "\\k", "\\Q", "\\E", ":", "!", "<=", ">")
val regexMetaChars = ".$^[]\\|?*+(){}"

This is correct, but also too restrictive.

Like if I wanted to match the '{' character it would not work because '{' is in the disallow list, even if I escaped it first.

regexp_replace(something, '\\{', '!!OPEN_BRACKET!!')

I don't understand why we are doing this when we have the tokenized regex already that is used to do the checks.

def getChoicesFromRegex(regex: RegexAST): Option[Seq[String]] = {
regex match {
case RegexGroup(_, t, None) =>
getChoicesFromRegex(t)
case RegexChoice(a, b) =>
getChoicesFromRegex(a) match {
case Some(la) =>
getChoicesFromRegex(b) match {
case Some(lb) => Some(la ++ lb)
case _ => None
}
case _ => None
}
case RegexSequence(parts) =>
if (GpuOverrides.isSupportedStringReplacePattern(regex.toRegexString)) {
Some(Seq(regex.toRegexString))
} else {
parts.foldLeft(Some(Seq[String]()): Option[Seq[String]]) { (m: Option[Seq[String]], r) =>
getChoicesFromRegex(r) match {
case Some(l) => m.map(_ ++ l)
case _ => None
}
}
}
case _ =>
if (GpuOverrides.isSupportedStringReplacePattern(regex.toRegexString)) {
Some(Seq(regex.toRegexString))
} else {
None
}
}
}

If we know all of the types a RegexAST can be, then we don't need to convert them back to a string. We can just do pattern matching and know for sure that this will work or not. We don't have to be complete about this in the first go at it. We could use this to augment/extend the existing code until we have a robust solution.

Here is my initial guess at fairly complete syntax that we could/should support (in a YACC like pseudo code pattern).

// A single static sting that is not a regular expression at all.
STATIC_STRING = RegexChar | // the static string value would be char in here. Note that with the current parser it looks like '\r', '\n', '\t', '\f', and '\e' should be covered by this in some form.
    RegexHexDigit | RegexOctalChar | // the static string value would be the code point in it converted to a character
    RegexEscaped([' | ']' | '\\' | '^' | '$' | '.' | '|' | '?' | '*' | '+' | '(' | ')' | '{' | '}' | '!' | '"' | '#' | '%' | '&' | '\'' | ',' | '-' | '/' | ':' | ';' | '<' | '=' | '>' | '@' | '_' | '`' | '~') | // teh static string value would be the char in here unchanged, no escape in front of it needed.
    RegexSequence(STATIC_STRING+) | // sequence of static strings, the static string value would be the static string values from the children concatenated together
    RegexGroup(capture_ignored, STATIC_STRING, None) | // group with no lookahead, The static string value of this is the same as the static string value of the child. The grouping can be ignored
    RegexCharacterClass(false, [STATIC_STRING]) // single character in a character class (possibly escaped). The static string value of this is the same as the static string value of the child.

// A list of one or more static strings that can be used in a multi-match, we could dedupe them if we though it was helpful
CHOICE_LIST = STATIC_STRING | // the result list is a single value list with the static string value in it.
   RegexChoice(CHOICE_LIST, CHOICE_LIST) | // the result list is the children combined concatenated together
   RegexCharacterClass(false, [STATIC_STRING+]) // the result list are the static string values of the child 

We could try to get even fancier and deal with a RegexCharacterClass with only a handful of choices in it in the STATIC_STRING section, where we could enumerate all of them, in the context of a sequence, but that feels like to much for a first go at this.

@revans2 revans2 added ? - Needs Triage Need team to review and classify feature request New feature or request performance A performance related task/issue labels Jan 24, 2025
@revans2
Copy link
Collaborator Author

revans2 commented Jan 24, 2025

Please note that we should update the code for the single replacement case too, not just the multi-replacement case. We probably want to use the same code as I outlined here, but just pick a different operation depending on if the list has a length of 1 or 2+.

We might want to look into other places that might benefit from similar analysis, like with rlike, that can do a gpu-multi-contains. But that code is a little different and uses the following code for it.

/**
* Matches the given regex ast to a regex optimization type for regex rewrite
* optimization.
*
* @param ast Abstract Syntax Tree parsed from a regex pattern.
* @return The `RegexOptimizationType` for the given pattern.
*/
def matchSimplePattern(ast: RegexAST): RegexOptimizationType = {
val astLs = ast match {
case RegexSequence(_) => ast.children()
case _ => Seq(ast)
}
val noTailingWildcards = stripTailingWildcards(astLs)
if (noTailingWildcards.headOption.exists(
ast => ast == RegexChar('^') || ast == RegexEscaped('A'))) {
val possibleLiteral = noTailingWildcards.drop(1)
if (isLiteralString(possibleLiteral)) {
return RegexOptimizationType.StartsWith(RegexCharsToString(possibleLiteral))
}
}
val noStartsWithAst = removeBrackets(stripLeadingWildcards(noTailingWildcards))
// Check if the pattern is a contains literal pattern
if (isLiteralString(noStartsWithAst)) {
// literal or .*(literal).* => contains literal
return RegexOptimizationType.Contains(RegexCharsToString(noStartsWithAst))
}
// Check if the pattern is a multiple contains literal pattern (e.g. "abc|def|ghi")
if (noStartsWithAst.length == 1) {
val containsLiterals = getMultipleContainsLiterals(noStartsWithAst.head)
if (!containsLiterals.isEmpty) {
return RegexOptimizationType.MultipleContains(containsLiterals)
}
}
// Check if the pattern is a prefix range pattern (e.g. "abc[a-z]{3}")
val prefixRangeInfo = getPrefixRangePattern(noStartsWithAst)
if (prefixRangeInfo.isDefined) {
val (prefix, length, start, end) = prefixRangeInfo.get
// (literal[a-b]{x,y}) => prefix range pattern
return RegexOptimizationType.PrefixRange(prefix, length, start, end)
}
// return NoOptimization if the pattern is not a simple pattern and use cuDF
RegexOptimizationType.NoOptimization
}
}

So it would not match perfectly, but the concepts are similar.

@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Feb 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request performance A performance related task/issue
Projects
None yet
Development

No branches or pull requests

2 participants