Slice Search Benchmark Analysis#14
Open
geoffreyclaude wants to merge 1 commit into
Open
Conversation
Benchmarks membership testing strategies (linear, binary search, hashset, branchless) across data types to find optimal algorithm cutoffs for SQL IN (list) processing. Key insight: a const-generic branchless approach using bitwise OR folding enables the compiler to fully unroll loops and auto-vectorize with SIMD, beating both linear and binary search for small numeric slices. The optimal cutoff correlates inversely with type size (i8: 128, i16: 64, i32: 32, i64: 16) because smaller types pack more elements per SIMD register. Surprising finding: binary search is never optimal in batch scenarios — branchless wins at small sizes via SIMD, hashset wins at large sizes with O(1) vs O(log n). This suggests DataFusion should use branchless for small numeric IN lists and hashset otherwise, skipping binary search entirely.
e894276 to
a75d3a1
Compare
15 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This document analyzes the performance of four search strategies for membership testing in sorted slices, typical of SQL
IN (list)processing.Benchmark Results
Results are stored per-CPU in subfolders. See:
Apple_M1_Max/slice_search.png— Visual comparisonApple_M1_Max/CUTOFFS.md— Recommended algorithm cutoffsBenchmark Configuration
Data Types Tested
i8i16i32i64i128strSearch Methods
slice.contains()) — O(n) scanslice.binary_search()) — O(log n)HashSet::contains()) — O(1) amortizedKey Findings
1. Branchless Dominates for Small Numeric Slices
The const-generic branchless implementation wins decisively for small numeric types:
Why? The compiler knows the exact array size at compile time, enabling:
2. HashSet Wins for Larger Slices
Once slice size exceeds the branchless threshold, HashSet's O(1) lookup dominates:
3. Binary Search is Never Optimal
Surprising finding: Binary search is never the best choice in batch scenarios.
Binary search only makes sense for single lookups where HashSet construction isn't amortized.
4. Strings Behave Differently
Without branchless (requires
Copytrait), strings show:Recommended Algorithm Selection
Simplified Decision Tree
Why These Cutoffs?
SIMD Register Capacity
The branchless cutoff correlates inversely with type size:
Smaller types pack more elements per SIMD register, extending the branchless advantage.
Empirical Formula
The M1 Max data reveals a consistent relationship:
For example, with NEON's 128-bit registers and
i32(32 bits):The multiplier of 8 represents approximately how many SIMD operations can execute before HashSet's O(1) lookup (with its hashing and memory indirection overhead) becomes faster.
Predicted Cutoffs by Platform
Applying the formula to different SIMD widths:
Note: Intel/AMD CPUs with AVX2 or AVX-512 have not yet been benchmarked. Running this benchmark suite on x86 hardware would validate the empirical formula and confirm whether the 8× multiplier holds across architectures. To benchmark with wider SIMD enabled:
HashSet Overhead
HashSet has fixed overhead:
This overhead is amortized at larger sizes but dominates at small sizes.
Implementation Notes
Branchless Check
The
const Nparameter enables the compiler to fully unroll and vectorize.Batch Processing
All benchmarks process 8,192 lookups per iteration, matching typical Arrow array batch sizes. This amortizes:
Reproducing Results
Results are automatically saved to a CPU-specific subfolder (e.g.,
results/Apple_M1_Max/).Conclusion
For DataFusion's IN LIST processing:
The branchless approach provides 2–10× speedup over alternatives for small numeric slices, making it the clear winner for the common case of IN lists with few elements.