-
Notifications
You must be signed in to change notification settings - Fork 988
Add approx_distinct_count #20735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add approx_distinct_count #20735
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
| hll.add(hash_values.begin(), hash_values.end(), cuda::stream_ref{stream.value()}); | ||
| return static_cast<cudf::size_type>(hll.estimate(cuda::stream_ref{stream.value()})); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about multi-gpu approx distinct count, I believe that two sketches can be combined by some binary operator, and that commutes through the estimate function. i.e. (hll(A) + hll(B)).estimate() == hll(A + B).estimate().
To produce a global approx distinct count from the GPU-local ones, I need to do this merge.
Can you provide an interface to return the hll.sketch_bytes() as an object that I can then combine with another sketch that was constructed using the same hashing scheme and approximation size?
Perhaps, spitballing:
std::unique_ptr<rmm::device_buffer> approx_distinct_count_sketch(args_as_for_approx_distinct_count);
std::unique_ptr<rmm::device_buffer>
merge_sketches(std::span<rmm::device_buffer> sketches) {
}?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do have merge APIs in HLL that address this need: https://github.com/NVIDIA/cuCollections/blob/d36905c69ce02d74abdd31dc864ce3e1ffc5a7db/include/cuco/hyperloglog.cuh#L159-L221. The question is really about how to surface this capability in libcudf. One idea I had is to expose a class like approx_estimator in libcudf so users can perform custom operations such as merge. However, that class would essentially just wrap cuco::hyperloglog, meaning that for multi-GPU scenarios users could simply use cuco::hyperloglog directly without needing any cudf abstraction. Does that sound reasonable, or is there something I’m overlooking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, exposing an object-oriented estimator instead of the current free function is likely the better approach. It offers significantly more flexibility, and given the complexity involved with row operations and null/nan handling, relying on users to manage those aspects themselves would be fairly complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think nan/null handling should be provided by us, rather than the end user. I've not yet looked as well at all the row_hasher apis, do we expose those in the public interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a very good point. The row operators reside in the detail namespace.
Description
Checklist