Skip to content

Commit fb3beba

Browse files
authored
refactor(physical-plan): extract make_group_column factory + eager init at try_new + tighten Time variants (#22751)
# Which issue does this PR close? PR 1 of 5 from the split agreed on #22706 (comment). Related to #22682 / #22715 (full `GroupValuesColumn` type coverage). Lays the dispatcher foundation; closes nothing on its own. # Rationale for this change Preparation for the nested-type `GroupColumn` work. No new builders here. The goal is to refactor the per-field builder dispatch in `GroupValuesColumn` so subsequent PRs that add `FixedSizeList` / `Struct` / `List` / `LargeList` support can plug into one well-defined factory instead of growing the inline match in `GroupValuesColumn::intern`. Also includes two adjacent correctness fixes around the `Time32` / `Time64` variants that came up in the upstream thread. # What changes are included in this PR 1. **Factory extraction.** The inline match that maps each schema field to a `Box<dyn GroupColumn>` builder moves out of `GroupValuesColumn::intern` into a free function `make_group_column(field: &Field) -> Result<Box<dyn GroupColumn>>`. Subsequent nested-type specializations can recursively call this factory for child field construction without enumerating every combination inline. 2. **Eager construction at `try_new`.** Per @2010YOUY01's review, the per-field builder vector is now built in the constructor via a private `build_group_columns` helper. `emit(EmitTo::All)` uses `mem::replace` to swap in a fresh vector after draining the old one; `clear_shrink` rebuilds the same way. The post-condition `self.group_values.len() == self.schema.fields().len()` holds across the aggregator's lifetime, so `intern` no longer carries a lazy-init branch. Unsupported schemas now fail fast at `try_new` rather than at the first `intern` call. In production this changes nothing because `new_group_values` in `aggregates/group_values/mod.rs` only calls `GroupValuesColumn::try_new` after `multi_group_by::supported_schema` returns true. 3. **`Time32` / `Time64` `supported_type` alignment.** Previously `supported_type` matched `Time32(_)` (admitting the invalid Microsecond / Nanosecond combinations) and did not match `Time64(_)` at all, while the dispatcher accepted `Time32(Second / Millisecond)` and `Time64(Microsecond / Nanosecond)`. Tighten `supported_type` to the exact set the dispatcher constructs. The dispatcher's wildcard arms for invalid `Time` variants now return `not_impl_err` instead of silently producing an empty builder vector. 4. **`supported_type` ↔ `make_group_column` consistency fuzz.** New unit test `supported_type_and_make_group_column_stay_in_sync` iterates a representative set of 20 supported and 6 unsupported `DataType` values and asserts the biconditional. Pins the alignment so future contributors who add a type to one side without the other trip a unit test immediately. # What this PR is NOT doing - No new `GroupColumn` builders. `FixedSizeList`, `Struct`, `List`, `LargeList` come in subsequent PRs of the #22682 sequence. - No new types in `supported_type`'s allow-list (the `Time` tightening removes invalid combinations rather than adding new ones). - No external dependencies. Per @alamb's review the earlier `dhat-heap` feature and the `dhat` dependency were dropped; memory savings in follow-up PRs will be measured via `GroupColumn::size()` head-to-head against `GroupValuesRows`, matching the existing `column_path_uses_less_memory_than_rows_for_*` tests pattern in #22706. # Are these changes tested - `cargo test -p datafusion-physical-plan --lib aggregates::group_values`: 30 tests pass (27 existing + 3 new: the consistency fuzz, the mixed-schema rejection, and the `try_new`-time NotImpl propagation). - `cargo test -p datafusion-physical-plan --lib aggregates`: 105 tests pass (no regression in the broader aggregate suite). - `cargo clippy -p datafusion-physical-plan --lib --tests -- -D warnings`: clean. - `cargo fmt --check`: clean. # Are there any user-facing changes No semantic change for any schema that already used `GroupValuesColumn` on main. The factory is the same dispatch logic pulled into a function; the `Time` changes only affect schemas that are semantically invalid Arrow types anyway; the eager construction surfaces `not_impl_err` from a different call site for those defensive paths.
1 parent d23321d commit fb3beba

1 file changed

Lines changed: 331 additions & 202 deletions

File tree

  • datafusion/physical-plan/src/aggregates/group_values/multi_group_by

0 commit comments

Comments
 (0)