Add approx_distinct_count #20735

PointKernel · 2025-11-26T02:38:31Z

Description

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2025-11-26T02:38:34Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

wence- · 2025-11-26T15:50:11Z

cpp/src/stream_compaction/approx_distinct_count.cu

+  hll.add(hash_values.begin(), hash_values.end(), cuda::stream_ref{stream.value()});
+  return static_cast<cudf::size_type>(hll.estimate(cuda::stream_ref{stream.value()}));
+}


Thinking about multi-gpu approx distinct count, I believe that two sketches can be combined by some binary operator, and that commutes through the estimate function. i.e. (hll(A) + hll(B)).estimate() == hll(A + B).estimate().

To produce a global approx distinct count from the GPU-local ones, I need to do this merge.

Can you provide an interface to return the hll.sketch_bytes() as an object that I can then combine with another sketch that was constructed using the same hashing scheme and approximation size?

Perhaps, spitballing:

std::unique_ptr<rmm::device_buffer> approx_distinct_count_sketch(args_as_for_approx_distinct_count); std::unique_ptr<rmm::device_buffer> merge_sketches(std::span<rmm::device_buffer> sketches) { }

?

We do have merge APIs in HLL that address this need: https://github.com/NVIDIA/cuCollections/blob/d36905c69ce02d74abdd31dc864ce3e1ffc5a7db/include/cuco/hyperloglog.cuh#L159-L221. The question is really about how to surface this capability in libcudf. One idea I had is to expose a class like approx_estimator in libcudf so users can perform custom operations such as merge. However, that class would essentially just wrap cuco::hyperloglog, meaning that for multi-GPU scenarios users could simply use cuco::hyperloglog directly without needing any cudf abstraction. Does that sound reasonable, or is there something I’m overlooking?

On second thought, exposing an object-oriented estimator instead of the current free function is likely the better approach. It offers significantly more flexibility, and given the complexity involved with row operations and null/nan handling, relying on users to manage those aspects themselves would be fairly complex.

Yeah, I think nan/null handling should be provided by us, rather than the end user. I've not yet looked as well at all the row_hasher apis, do we expose those in the public interface?

That's a very good point. The row operators reside in the detail namespace.

PointKernel added 2 commits November 25, 2025 17:55

Add approx_distinct_count

9ceb563

Cleanups

afdcdf2

github-actions bot assigned PointKernel Nov 26, 2025

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Nov 26, 2025

wence- reviewed Nov 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add approx_distinct_count #20735

Add approx_distinct_count #20735

Uh oh!

PointKernel commented Nov 26, 2025

Uh oh!

copy-pr-bot bot commented Nov 26, 2025

Uh oh!

wence- Nov 26, 2025

Uh oh!

PointKernel Nov 26, 2025

Uh oh!

PointKernel Nov 26, 2025

Uh oh!

wence- Nov 26, 2025

Uh oh!

PointKernel Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add approx_distinct_count #20735

Are you sure you want to change the base?

Add approx_distinct_count #20735

Uh oh!

Conversation

PointKernel commented Nov 26, 2025

Description

Checklist

Uh oh!

copy-pr-bot bot commented Nov 26, 2025

Uh oh!

wence- Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

PointKernel Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

PointKernel Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

wence- Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

PointKernel Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants