Add `PrimitiveDistinctCountGroupsAccumulator` #15985

Dandandan · 2025-05-07T20:55:49Z

Which issue does this PR close?

Closes #.

Rationale for this change

Speed up queries with group by + distinct count, for primitives.

The original code is taken from @waynexia + the change to use HashTable.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

…thub.com:Dandandan/arrow-datafusion into implement_primitive_distinct_groups_accumulator

Dandandan · 2025-05-08T14:11:39Z

This gets a small performance boost on clickbench query 9 (~9% on my end).

I am actually wondering if we can do further. I think we could store something like HashSet<(T::Native, usize)> (unique value + group id) instead of Vec<HashSet<T::Native>> (hashset per group) and delaying counting the values until the end by iterating all the values (instead of .len()).

"Obvious" advantage is that we avoid creating many hashsets for high cardinality cases which makes performance and memory usage bad.

However it seems kind of tricky of how to integrate it in the current groupsaccumulator setup 🤔

Dandandan added 2 commits May 7, 2025 22:52

Add PrimitiveDistinctCountGroupsAccumulator

ba897a7

Merge branch 'main' into implement_primitive_distinct_groups_accumulator

dad561d

github-actions bot added the functions Changes to functions implementation label May 7, 2025

Dandandan added 2 commits May 7, 2025 22:59

Add PrimitiveDistinctCountGroupsAccumulator

66a0b63

:werge branch 'implement_primitive_distinct_groups_accumulator' of gi…

a86e804

…thub.com:Dandandan/arrow-datafusion into implement_primitive_distinct_groups_accumulator

Dandandan changed the title ~~Add PrimitiveDistinctCountGroupsAccumulator~~ Add PrimitiveDistinctCountGroupsAccumulator, speed up PrimitiveDistinctCountAccumulator May 7, 2025

Dandandan force-pushed the implement_primitive_distinct_groups_accumulator branch from 1738b0b to a86e804 Compare May 7, 2025 21:44

Dandandan changed the title ~~Add PrimitiveDistinctCountGroupsAccumulator, speed up PrimitiveDistinctCountAccumulator~~ Add PrimitiveDistinctCountGroupsAccumulator May 7, 2025

Dandandan force-pushed the implement_primitive_distinct_groups_accumulator branch from cb9cc84 to a86e804 Compare May 7, 2025 22:15

Cleanup

ee69484

Dandandan force-pushed the implement_primitive_distinct_groups_accumulator branch from 5e14c73 to ee69484 Compare May 7, 2025 22:36

Dandandan added 4 commits May 8, 2025 00:55

Cleanup

580e262

Cleanup

81d0408

Cleanup

050e585

Cleanup

148c487

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `PrimitiveDistinctCountGroupsAccumulator` #15985

Add `PrimitiveDistinctCountGroupsAccumulator` #15985

Dandandan commented May 7, 2025 •

edited

Loading

Dandandan commented May 8, 2025 •

edited

Loading

Add PrimitiveDistinctCountGroupsAccumulator #15985

Are you sure you want to change the base?

Add PrimitiveDistinctCountGroupsAccumulator #15985

Conversation

Dandandan commented May 7, 2025 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Dandandan commented May 8, 2025 • edited Loading

Add `PrimitiveDistinctCountGroupsAccumulator` #15985

Add `PrimitiveDistinctCountGroupsAccumulator` #15985

Dandandan commented May 7, 2025 •

edited

Loading

Dandandan commented May 8, 2025 •

edited

Loading