feat: parallel sort-preserving merge (PSRS) for >2x merge speedup (demo)#23124
feat: parallel sort-preserving merge (PSRS) for >2x merge speedup (demo)#23124Dandandan wants to merge 1 commit into
Conversation
Add `ParallelSortPreservingMergeExec` (`sorts/parallel_merge.rs`), a parallel order-preserving merge of sorted partitions using Parallel Sorting by Regular Sampling (PSRS): materialize each locally-sorted input, pick `P-1` pivots by regular sampling, cut every run by the same pivots (binary search on byte-comparable row encodings), then merge the `P` key-range buckets concurrently (one `SpawnedTask` each, reusing the loser-tree `StreamingMergeBuilder`) and concatenate the buckets in order. Output is a single globally-sorted partition — same contract as `SortPreservingMergeExec`, computed with `target_partitions`-way parallelism. A new `JoinSelection`-style physical-optimizer rule `ParallelSortMerge` (`parallel_sort_merge.rs`, run after `PushdownSort`) replaces an eligible `SortPreservingMergeExec` with the parallel exec, gated by `datafusion.optimizer.enable_parallel_sort_merge` (default true). Eligibility: no fetch/limit, bounded input, >1 input partition, and a *known* row count >= `batch_size * target_partitions` (unknown-size inputs keep the serial merge to avoid regressing small data). It is pipeline-breaking (materializes all input, no early-stop), so it never honors a pushed-down limit and is restricted to bounded inputs. Measured: ~2.36x on the `sort_preserving_merge` bench (1M rows x 3 partitions, u64); ~2-2.5x on sort_tpch (Q1 2.54x, Q2 2.31x, Q10 1.93x). A `--parallel-merge` A/B flag is added to the sort_tpch benchmark. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
run benchmark sort_tpch |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing feat/parallel-sort-preserving-merge (611030b) to dc92bb8 (merge-base) diff using: sort_tpch File an issue against this benchmark runner |
|
Thank you for opening this pull request! Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch). Details |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagesort_tpch — base (merge-base)
sort_tpch — branch
File an issue against this benchmark runner |
Which issue does this PR close?
Rationale for this change
SortPreservingMergeExecperforms a single-threaded k-way merge of all inputpartitions on one thread. For merge-bound sort queries (e.g.
ORDER BYover manysorted partitions) this leaves most cores idle while one thread does all the work.
This PR adds a parallel order-preserving merge that splits the merge across
target_partitionsthreads, giving a large speedup on merge-bound queries:sort_preserving_mergemicrobenchmark (1M rows × 3partitions, single u64 key): 33.7 ms → 14.3 ms.
What changes are included in this PR?
ParallelSortPreservingMergeExec(sorts/parallel_merge.rs) — a parallelorder-preserving merge using Parallel Sorting by Regular Sampling (PSRS):
sort keys encoded on demand with a single shared
RowConverterso all keysare byte-comparable).
P-1pivots.lower_bound).Pkey-range buckets concurrently — oneSpawnedTaskeach,reusing the existing optimized loser-tree
StreamingMergeBuilder.output contract as
SortPreservingMergeExec.Correctness does not depend on balanced pivots; regular sampling only affects
load balance. Output partitioning is
UnknownPartitioning(1).ParallelSortMergephysical-optimizer rule (parallel_sort_merge.rs, runsafter
PushdownSort) that replaces an eligibleSortPreservingMergeExecwiththe parallel exec. Gated by
datafusion.optimizer.enable_parallel_sort_merge(default
true). Eligible iff: no fetch/limit, bounded input,> 1inputpartition, and a known row count
>= batch_size * target_partitions(unknown-size inputs keep the serial merge to avoid regressing small data).
A
--parallel-mergeA/B flag on thesort_tpchbenchmark, a parallel variantin the
sort_preserving_mergecriterion bench, and regeneratedconfigs.md.Limitations
It is pipeline-breaking: it materializes all of its input and emits nothing
until every bucket is merged, so it never honors a pushed-down fetch/limit and is
restricted to bounded inputs. Peak memory ≈ full input + output (tracked via
MemoryReservations, not currently spillable). Queries needing early terminationkeep using
SortPreservingMergeExec. Marked draft to gather feedback on thedefault and the memory trade-off.
Are these changes tested?
Yes:
sorts::parallel_merge): output matchesSortPreservingMergeExecfor unique keys, heavy low-cardinality ties, descending + nulls, uneven/empty
partitions, more-buckets-than-rows, and single-partition passthrough.
on by default; EXPLAIN plans in a few files now show
ParallelSortPreservingMergeExec(results are unchanged — the output isidentical to the serial merge).
Are there any user-facing changes?
datafusion.optimizer.enable_parallel_sort_merge(defaulttrue).EXPLAINshowsParallelSortPreservingMergeExecwhere the rule applies, andsuch plans avoid the single-threaded merge bottleneck (at the cost of
materializing the input). No public API breakage.
🤖 Generated with Claude Code