Commit fb3beba
authored
refactor(physical-plan): extract make_group_column factory + eager init at try_new + tighten Time variants (#22751)
# Which issue does this PR close?
PR 1 of 5 from the split agreed on
#22706 (comment).
Related to #22682 / #22715 (full `GroupValuesColumn` type coverage).
Lays the dispatcher foundation; closes nothing on its own.
# Rationale for this change
Preparation for the nested-type `GroupColumn` work. No new builders
here. The goal is to refactor the per-field builder dispatch in
`GroupValuesColumn` so subsequent PRs that add `FixedSizeList` /
`Struct` / `List` / `LargeList` support can plug into one well-defined
factory instead of growing the inline match in
`GroupValuesColumn::intern`. Also includes two adjacent correctness
fixes around the `Time32` / `Time64` variants that came up in the
upstream thread.
# What changes are included in this PR
1. **Factory extraction.** The inline match that maps each schema field
to a `Box<dyn GroupColumn>` builder moves out of
`GroupValuesColumn::intern` into a free function
`make_group_column(field: &Field) -> Result<Box<dyn GroupColumn>>`.
Subsequent nested-type specializations can recursively call this factory
for child field construction without enumerating every combination
inline.
2. **Eager construction at `try_new`.** Per @2010YOUY01's review, the
per-field builder vector is now built in the constructor via a private
`build_group_columns` helper. `emit(EmitTo::All)` uses `mem::replace` to
swap in a fresh vector after draining the old one; `clear_shrink`
rebuilds the same way. The post-condition `self.group_values.len() ==
self.schema.fields().len()` holds across the aggregator's lifetime, so
`intern` no longer carries a lazy-init branch. Unsupported schemas now
fail fast at `try_new` rather than at the first `intern` call. In
production this changes nothing because `new_group_values` in
`aggregates/group_values/mod.rs` only calls `GroupValuesColumn::try_new`
after `multi_group_by::supported_schema` returns true.
3. **`Time32` / `Time64` `supported_type` alignment.** Previously
`supported_type` matched `Time32(_)` (admitting the invalid Microsecond
/ Nanosecond combinations) and did not match `Time64(_)` at all, while
the dispatcher accepted `Time32(Second / Millisecond)` and
`Time64(Microsecond / Nanosecond)`. Tighten `supported_type` to the
exact set the dispatcher constructs. The dispatcher's wildcard arms for
invalid `Time` variants now return `not_impl_err` instead of silently
producing an empty builder vector.
4. **`supported_type` ↔ `make_group_column` consistency fuzz.** New unit
test `supported_type_and_make_group_column_stay_in_sync` iterates a
representative set of 20 supported and 6 unsupported `DataType` values
and asserts the biconditional. Pins the alignment so future contributors
who add a type to one side without the other trip a unit test
immediately.
# What this PR is NOT doing
- No new `GroupColumn` builders. `FixedSizeList`, `Struct`, `List`,
`LargeList` come in subsequent PRs of the #22682 sequence.
- No new types in `supported_type`'s allow-list (the `Time` tightening
removes invalid combinations rather than adding new ones).
- No external dependencies. Per @alamb's review the earlier `dhat-heap`
feature and the `dhat` dependency were dropped; memory savings in
follow-up PRs will be measured via `GroupColumn::size()` head-to-head
against `GroupValuesRows`, matching the existing
`column_path_uses_less_memory_than_rows_for_*` tests pattern in #22706.
# Are these changes tested
- `cargo test -p datafusion-physical-plan --lib
aggregates::group_values`: 30 tests pass (27 existing + 3 new: the
consistency fuzz, the mixed-schema rejection, and the `try_new`-time
NotImpl propagation).
- `cargo test -p datafusion-physical-plan --lib aggregates`: 105 tests
pass (no regression in the broader aggregate suite).
- `cargo clippy -p datafusion-physical-plan --lib --tests -- -D
warnings`: clean.
- `cargo fmt --check`: clean.
# Are there any user-facing changes
No semantic change for any schema that already used `GroupValuesColumn`
on main. The factory is the same dispatch logic pulled into a function;
the `Time` changes only affect schemas that are semantically invalid
Arrow types anyway; the eager construction surfaces `not_impl_err` from
a different call site for those defensive paths.1 parent d23321d commit fb3beba
1 file changed
Lines changed: 331 additions & 202 deletions
0 commit comments