- 
                Notifications
    
You must be signed in to change notification settings  - Fork 1.7k
 
Refactor InListExpr to store arrays and support structs #18449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| // TODO: serialize the inner ArrayRef directly to avoid materialization into literals | ||
| // by extending the protobuf definition to support both representations and adding a public | ||
| // accessor method to InListExpr to get the inner ArrayRef | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll create a followup issue once we merge this
| 05)--------ProjectionExec: expr=[] | ||
| 06)----------CoalesceBatchesExec: target_batch_size=8192 | ||
| 07)------------FilterExec: substr(md5(CAST(value@0 AS Utf8View)), 1, 32) IN ([7f4b18de3cfeb9b4ac78c381ee2ad278, a, b, c]) | ||
| 07)------------FilterExec: substr(md5(CAST(value@0 AS Utf8View)), 1, 32) IN (SET) ([7f4b18de3cfeb9b4ac78c381ee2ad278, a, b, c]) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because we now support Utf8View for building the sets 😄
| let random_state = RandomState::with_seed(0); | ||
| let mut hashes_buf = vec![0u64; array.len()]; | ||
| let Ok(hashes_buf) = create_hashes_from_arrays( | ||
| &[array.as_ref()], | ||
| &random_state, | ||
| &mut hashes_buf, | ||
| ) else { | ||
| unreachable!("Failed to create hashes for InList array. This shouldn't happen because make_set should have succeeded earlier."); | ||
| }; | ||
| hashes_buf.hash(state); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could pre-compute and store a hash: u64 which would be both more performant when Hash is called and avoid this error, but it would add more complexity and some overhead when building the InListExpr
Change create_hashes and related functions to work with &dyn Array references instead of requiring ArrayRef (Arc-wrapped arrays). This avoids unnecessary Arc::clone() calls and enables calls that only have an &dyn Array to use the hashing utilities. Changes: - Add create_hashes_from_arrays(&[&dyn Array]) function - Refactor hash_dictionary, hash_list_array, hash_fixed_list_array to use references instead of cloning - Extract hash_single_array() helper for common logic
4d4b797    to
    9a0f6be      
    Compare
  
    Changes:
- Enhance InListExpr to efficiently store homogeneous lists as arrays and avoid a conversion to Vec<PhysicalExpr>
    by adding an internal InListStorage enum with Array and Exprs variants
- Re-use existing hashing and comparison utilities to support Struct arrays and other complex types
- Add public function `in_list_from_array(expr, list_array, negated)` for creating InList from arrays
    9a0f6be    to
    f1f3b66      
    Compare
  
    
Background
This PR is part of an EPIC to push down hash table references from HashJoinExec into scans. The EPIC is tracked in #17171.
A "target state" is tracked in #18393.
There is a series of PRs to get us to this target state in smaller more reviewable changes that are still valuable on their own:
HashJoinExecand use CASE expressions for more precise filters #18451Changes in this PR
by adding an internal InListStorage enum with Array and Exprs variants
in_list_from_array(expr, list_array, negated)for creating InList from arraysAlthough the diff looks large most of it is actually tests and docs. I think the actual code change is a negative LOC change, or at least negative complexity (eliminates a trait, a macro, matching on data types).