Skip to content

Commit 73cb7e8

Browse files
committed
refactor(physical-plan): extract make_group_column factory + tighten Time variants + dhat harness
Preparation PR for #22682 / #22715 (full GroupValuesColumn type coverage). No new builders; this PR only refactors the dispatch so subsequent PRs that add FixedSizeList / Struct / List support can plug into one well-defined factory instead of growing the inline match in `GroupValuesColumn::intern`. Includes two adjacent fixes adriangb asked for in the upstream PR thread. ## What changed 1. **Factory extraction.** The inline match that maps each schema field to a `Box<dyn GroupColumn>` builder moves out of `GroupValuesColumn::intern` into a free function `make_group_column(field: &Field) -> Result<Box<dyn GroupColumn>>`. `intern` is now a one-liner loop that delegates per-field. Future nested-type specializations recursively call this factory for child field construction without enumerating every combination inline. 2. **Time32 / Time64 supported_type alignment.** Previously `supported_type` matched `Time32(_)` (admitting the invalid Microsecond / Nanosecond combinations) and did not match `Time64(_)` at all, while the dispatcher accepted `Time32(Second / Millisecond)` and `Time64(Microsecond / Nanosecond)`. Tighten `supported_type` to the exact set the dispatcher constructs. The dispatcher's wildcard arms for invalid Time variants now return `not_impl_err` instead of silently doing nothing, so a schema that somehow reaches the dispatcher fails loudly rather than producing an empty builder vector. 3. **supported_type ↔ make_group_column consistency fuzz.** New unit test `supported_type_and_make_group_column_stay_in_sync` iterates a representative set of 20 supported and 6 unsupported `DataType`s and asserts the biconditional. Pins the alignment so future contributors who add a type to one side without the other trip a unit test immediately. 4. **dhat heap regression harness behind `dhat-heap` feature.** Adds an optional `dhat` dependency, a new `dhat-heap` feature in `physical-plan/Cargo.toml`, a feature-gated global allocator in `lib.rs`, and a `dhat_tests` submodule that wraps a representative `GroupValuesColumn::intern` workload in `dhat::Profiler` and asserts peak heap stays under a budget. Default `cargo test` does not compile or run dhat. Subsequent type-support PRs should lower the budget to reflect their savings. ## Testing - `cargo test -p datafusion-physical-plan --lib aggregates::group_values` passes 30 tests (27 existing + 3 new: the consistency fuzz, the mixed-schema rejection, and the NotImpl propagation). - `cargo test -p datafusion-physical-plan --lib aggregates` passes 105 tests (no regression in the broader aggregate suite). - `cargo test -p datafusion-physical-plan --features dhat-heap aggregates::group_values::multi_group_by::dhat_tests` passes 1 test with peak heap well under the 1 MiB budget on a 1k-row workload. - `cargo clippy -p datafusion-physical-plan --lib --tests --features dhat-heap -- -D warnings` is clean. ## What this PR is NOT doing - No new `GroupColumn` builders. `FixedSizeList`, `Struct`, `List`, `LargeList` come in subsequent PRs of the #22682 sequence. - No new types in `supported_type`'s allow-list (the Time tightening removes invalid combinations rather than adding new ones). - No behavior change for any schema that already used `GroupValuesColumn` on main: the factory is the same dispatch logic pulled into a function, and the Time changes only affect schemas that are semantically invalid Arrow types anyway. This is PR 1 of the 5-PR split discussed at #22706 (comment).
1 parent f033199 commit 73cb7e8

4 files changed

Lines changed: 456 additions & 174 deletions

File tree

Cargo.lock

Lines changed: 74 additions & 3 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

datafusion/physical-plan/Cargo.toml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,11 @@ force_hash_collisions = []
4242
test_utils = ["arrow/test_utils"]
4343
tokio_coop = []
4444
tokio_coop_fallback = []
45+
# Enables the dhat-based heap regression tests under
46+
# `aggregates::group_values::multi_group_by`. Off by default so the
47+
# normal test suite does not pay the profiler overhead. Run with
48+
# `cargo test -p datafusion-physical-plan --features dhat-heap`.
49+
dhat-heap = ["dep:dhat"]
4550
# Enables `PhysicalExpr::try_to_proto` / `try_from_proto` hooks on the
4651
# physical expressions defined in this crate (e.g. `HashExpr`). Off by
4752
# default so consumers that never serialize plans pay nothing.
@@ -76,6 +81,9 @@ datafusion-physical-expr = { workspace = true, default-features = true }
7681
datafusion-physical-expr-common = { workspace = true }
7782
datafusion-proto-common = { workspace = true, optional = true }
7883
datafusion-proto-models = { workspace = true, optional = true }
84+
# Optional heap profiler used by the `dhat-heap` feature for memory
85+
# regression tests on `GroupValuesColumn`.
86+
dhat = { version = "0.3", optional = true }
7987
futures = { workspace = true }
8088
half = { workspace = true }
8189
hashbrown = { workspace = true }

0 commit comments

Comments
 (0)