refactor: Split hash aggregation logic into separated streams#22729
refactor: Split hash aggregation logic into separated streams#227292010YOUY01 wants to merge 1 commit into
Conversation
| impl Stream for PartialFinalHashAggregateStream { | ||
| type Item = Result<RecordBatch>; | ||
|
|
||
| fn poll_next( |
There was a problem hiding this comment.
The state machines are identical for now, but in follow-up work, such as skipping partial aggregation for high-cardinality inputs, their control flows will diverge. I think separating them improves clarity, as discussed in #22710.
Some duplication is inevitable, but that is the trade-off.
|
cc @Dandandan @ariel-miculas @alamb, who have expressed interest before. |
|
I'm curious about the high-level vision: is the plan to close #15591 in favor of this new approach? I would like the redesign of hash aggregation to take into account the memory constraints imposed by the finite memory pool, i.e. how does the implementation perform under OOM conditions.
Otherwise we'll end up with the same issues that exist now. E.g. EmitTo::First(n) wasn't designed for emitting a large portion of the existing groups, so it over-allocated when used for emitting early in partial aggregation OOM case. |
Yes, the goal is to support blocked state management. The existing challenge is that the current implementation is hard to extend and review. I want to clean things up through this refactor first, and then apply the actual change.
All of these issues are symptoms of managing state in a large contiguous |
Which issue does this PR close?
Rationale for this change
See issues.
This PR split out partial and final aggregate strem from
GroupsHashAggregateStreamTo fully migrate hash aggregation, we have to
I think they should be leave to follow up PRs
Todo in this PR:
enable_migration_aggregateto turn off this pathSince it should be a regression if the above features are not added, it also helps if to prevent potential regressions from the migration of other aggregate streams.
What changes are included in this PR?
Split out the streams from
GroupsHashAggregateStreamAre these changes tested?
Are there any user-facing changes?