-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Project RecordBatch before evaluating case
#18055
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
3026576
to
bf3d012
Compare
I hadn't considered physical expression serialisation in my implementation. Perhaps it would be better to hide the projection logic inside |
apache/arrow-rs#8591 should help mitigate the overhead of the |
bf3d012
to
1a1a148
Compare
@alamb this one is the next episode in my quest to squeeze more performance out of case. The TL;DR is that we project away unneeded columns from the record batch to avoid work during filtering. I thought this would be most elegant as a decorator expression, but that's immediately visible externally and as a consequence would need serialisation support. The alternative is that I pull this into CaseExpr, perhaps as a distinct (or wrapping) evaluation mode? Any opinion on the direction in which to proceed? |
LOVE IT! I will continue to help support you / review these PRs. I love a good set of performance optimizations I am also going to start tracking the collection here as a larger theme |
There is similar code for filtering here (namely that evaluates the filter expression first, and then only calles datafusion/datafusion/physical-plan/src/filter.rs Lines 644 to 668 in a8925f3
|
This touches on one of the things I was struggling with a bit working in the I already need the schema anyway in order to decide if it makes sense to project or not. One simple solution is to just keep a reference to that one. But things get a bit weird when a |
At somepoint in the past we discussed some sort of "precompile" step for PhysicalExprs -- which was invoked right before the plan started executing. The usecase as I recall was to compile regexp once (rather than per batch). Maybe it is time to do that now 🤔 There is more background here if you are interested |
Which issue does this PR close?
SELECT *, CASE ... END
#18056Rationale for this change
When
CaseExpr
callsPhysicalExpr::evaluate_selection
that function will filter the entire inputRecordBatch
before callingPhysicalExpr::evaluate
. This filtering filters all columns of theRecordBatch
, even ones that will not be accessed by thePhysicalExpr
. For wide record batches and narrow expressions it can be beneficial to project the record batch first to reduce the amount of wasted filtering work.What changes are included in this PR?
ProjectedExpr
type which projects incoming record batches and then evaluates a project version of the originalPhysicalExpr
Are these changes tested?
case
has been extendedAre there any user-facing changes?
No