Skip to content

Add PrimitiveDistinctCountGroupsAccumulator #15985

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

Dandandan
Copy link
Contributor

@Dandandan Dandandan commented May 7, 2025

Which issue does this PR close?

  • Closes #.

Rationale for this change

Speed up queries with group by + distinct count, for primitives.

The original code is taken from @waynexia + the change to use HashTable.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the functions Changes to functions implementation label May 7, 2025
Dandandan added 2 commits May 7, 2025 22:59
…thub.com:Dandandan/arrow-datafusion into implement_primitive_distinct_groups_accumulator
@Dandandan Dandandan changed the title Add PrimitiveDistinctCountGroupsAccumulator Add PrimitiveDistinctCountGroupsAccumulator, speed up PrimitiveDistinctCountAccumulator May 7, 2025
@Dandandan Dandandan force-pushed the implement_primitive_distinct_groups_accumulator branch from 1738b0b to a86e804 Compare May 7, 2025 21:44
@Dandandan Dandandan changed the title Add PrimitiveDistinctCountGroupsAccumulator, speed up PrimitiveDistinctCountAccumulator Add PrimitiveDistinctCountGroupsAccumulator May 7, 2025
@Dandandan Dandandan force-pushed the implement_primitive_distinct_groups_accumulator branch from cb9cc84 to a86e804 Compare May 7, 2025 22:15
@Dandandan Dandandan force-pushed the implement_primitive_distinct_groups_accumulator branch from 5e14c73 to ee69484 Compare May 7, 2025 22:36
@Dandandan
Copy link
Contributor Author

Dandandan commented May 8, 2025

This gets a small performance boost on clickbench query 9 (~9% on my end).

I am actually wondering if we can do further. I think we could store something like HashSet<(T::Native, usize)> (unique value + group id) instead of Vec<HashSet<T::Native>> (hashset per group) and delaying counting the values until the end by iterating all the values (instead of .len()).

"Obvious" advantage is that we avoid creating many hashsets for high cardinality cases which makes performance and memory usage bad.

However it seems kind of tricky of how to integrate it in the current groupsaccumulator setup 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
functions Changes to functions implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant