-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Extend support for static string matching on regex_replace #12023
Comments
Please note that we should update the code for the single replacement case too, not just the multi-replacement case. We probably want to use the same code as I outlined here, but just pick a different operation depending on if the list has a length of 1 or 2+. We might want to look into other places that might benefit from similar analysis, like with rlike, that can do a gpu-multi-contains. But that code is a little different and uses the following code for it. spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala Lines 2108 to 2156 in c82a2ea
So it would not match perfectly, but the concepts are similar. |
Is your feature request related to a problem? Please describe.
When we convert a
regex_replace
call to run on the GPU we parse the regular expression and transpile it to be able to run on the CUDF regular expression engine. We also include a performance optimization where we can look to see if we could extract out a list of static strings to match instead, because this tends to be much more efficient.The code for this is very restrictive because it relies on converting part of the regular expression back to a string and checking it against a list of known bad character sequences.
spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala
Lines 597 to 601 in c82a2ea
spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala
Lines 449 to 453 in c82a2ea
This is correct, but also too restrictive.
Like if I wanted to match the '{' character it would not work because '{' is in the disallow list, even if I escaped it first.
I don't understand why we are doing this when we have the tokenized regex already that is used to do the checks.
spark-rapids/sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala
Lines 1127 to 1158 in c82a2ea
If we know all of the types a
RegexAST
can be, then we don't need to convert them back to a string. We can just do pattern matching and know for sure that this will work or not. We don't have to be complete about this in the first go at it. We could use this to augment/extend the existing code until we have a robust solution.Here is my initial guess at fairly complete syntax that we could/should support (in a YACC like pseudo code pattern).
We could try to get even fancier and deal with a RegexCharacterClass with only a handful of choices in it in the STATIC_STRING section, where we could enumerate all of them, in the context of a sequence, but that feels like to much for a first go at this.
The text was updated successfully, but these errors were encountered: