Skip to content

Conversation

@ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Nov 26, 2025

During computing the compound aggregations in hash-based groupby, we may perform unnecessary work or repeat the same computation multiple times. For example, we may output nulls for the aggregation results and launch a kernel to count nulls for them, although they are computed only to serve as intermediate data for other compound aggregation and their null masks/null counts are never accessed.

In the situations when we have many aggregations (hundred+ to thousand+ and up), such repeated or unnecessary work accumulates to a significant runtime overhead. As such, we want to reduce them as little as possible. This PR intends to do so, modifying the existing hash-based groupby aggregation framework to allow ignoring nulls for the aggregation outputs. By doing so, the results of aggregations requested by the users can have nulls as before (no change) while the aggregations generated only to serve as intermediate data for computing compound aggregations are always non-nullable.

Closes #20734.

@ttnghia ttnghia self-assigned this Nov 26, 2025
@ttnghia ttnghia requested a review from a team as a code owner November 26, 2025 05:53
@ttnghia ttnghia added this to libcudf Nov 26, 2025
@ttnghia ttnghia added 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue Spark Functionality that helps Spark RAPIDS improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 26, 2025
Comment on lines -96 to -105
std::vector<std::unique_ptr<aggregation>> visit(
data_type, cudf::detail::correlation_aggregation const&) override
{
std::vector<std::unique_ptr<aggregation>> aggs;
aggs.push_back(make_sum_aggregation());
// COUNT_VALID
aggs.push_back(make_count_aggregation());

return aggs;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused code: correlation aggregation is not yet supported in hash-based pipeline.

// If it is not an input aggregation, we can force its output to be non-nullable
// (by storing `1` value in the `force_non_nullable` vector).
auto const is_input_agg = input_agg_kinds_set.contains(agg->kind);
force_non_nullable.push_back(is_input_agg ? 0 : 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
force_non_nullable.push_back(is_input_agg ? 0 : 1);
force_non_nullable.push_back(not is_input_agg);

Using int (4-bytes) for this vector seems like a waste. Perhaps using int8 instead. Using bool will not work with spans unfortunately.
I also recommend using std::span instead of cudf::host_span if possible.
We can revisit the other host_span usages here in a later PR. Reference: #20539

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the code to use int8_t.
I also tried using cuda::std::span but couldn't compile due to:

error: no suitable user-defined conversion from "std::tuple_element<1UL, const std::tuple<cudf::table_view, cudf::detail::host_vector<cudf::aggregation::Kind>, std::vector<std::unique_ptr<cudf::aggregation, std::default_delete<cudf::aggregation>>, std::allocator<std::unique_ptr<cudf::aggregation, std::default_delete<cudf::aggregation>>>>, std::vector<int8_t, std::allocator<int8_t>>, bool>>::type" 
(aka "const std::__tuple_element_t<1UL, std::tuple<cudf::table_view, cudf::detail::host_vector<cudf::aggregation::Kind>, std::vector<std::unique_ptr<cudf::aggregation, std::default_delete<cudf::aggregation>>, std::allocator<std::unique_ptr<cudf::aggregation, std::default_delete<cudf::aggregation>>>>, std::vector<signed char, std::allocator<signed char>>, bool>>") 
to "cuda::std::__4::span<const cudf::aggregation::Kind, 18446744073709551615UL>" exists

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is host-only, the std::span should be enough. No need for cuda::std::span.
I'm not sure if that will help this though. I can look at this more closely next week.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I found that the issue is due to implicit converting cudf::detail::host_vector to (cuda::)std::span. The implicit conversion is not implemented yet, so we should do that soon.
When trying to use std::span in place of host_span, the modification chains down to nearly dozen of unrelated files thus I would rather have std::span adoption implemented in other separate PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I tried std::span locally and only had to change 6 files in this PR.

Copy link
Contributor

@davidwendt davidwendt Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The host_vector should not come into play. Are you trying to change all of the host_span usages?
I would recommend to only change the host_span usage for this new int8 variable and we'll change any of the other host_span usages in a follow up PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I tried to change host_span for this new variable and another one, since they are passing to the same function:

std::unique_ptr<table> create_results_table(size_type output_size,
                                            table_view const& values,
                                            host_span<aggregation::Kind const> agg_kinds,
                                            host_span<int8_t const> force_non_nullable,

so it makes more sense to change both together.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand. But it would require too many changes.
Whereas changing just the one for this PR will be enough for it to pass.
Please just change the one variable. We can change the other one(s) in a later PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. std::span looks great now. Excited waiting for the full adoption 👍

@ttnghia
Copy link
Contributor Author

ttnghia commented Nov 26, 2025

Benchmark for M2 with multiple aggregation requests:

# groupby_m2_var_std

## [0] Quadro RTX 6000

|  T  |  U  |  value_key_ratio  |  num_rows  |  null_probability  |  num_aggs  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |          Diff |   %Diff |  Status  |
|-----|-----|-------------------|------------|--------------------|------------|------------|-------------|------------|-------------|---------------|---------|----------|
| I32 | 11  |        20         |   100000   |        0.5         |     1      | 201.283 us |      15.03% | 173.175 us |      12.42% |    -28.108 us | -13.96% |   FAST   |
| I32 | 11  |        100        |   100000   |        0.5         |     1      | 188.973 us |      13.28% | 164.053 us |      10.28% |    -24.920 us | -13.19% |   FAST   |
| I32 | 11  |        20         |  10000000  |        0.5         |     1      |   3.472 ms |       2.65% |   3.306 ms |       1.98% |   -166.058 us |  -4.78% |   FAST   |
| I32 | 11  |        100        |  10000000  |        0.5         |     1      |   2.374 ms |       3.10% |   2.281 ms |       1.90% |    -93.329 us |  -3.93% |   FAST   |
| I32 | 11  |        20         |   100000   |        0.5         |     10     | 964.160 us |       5.69% | 757.818 us |       3.66% |   -206.341 us | -21.40% |   FAST   |
| I32 | 11  |        100        |   100000   |        0.5         |     10     | 978.234 us |       5.45% | 772.746 us |       3.65% |   -205.488 us | -21.01% |   FAST   |
| I32 | 11  |        20         |  10000000  |        0.5         |     10     |  13.937 ms |       2.04% |  12.358 ms |       1.79% |  -1579.058 us | -11.33% |   FAST   |
| I32 | 11  |        100        |  10000000  |        0.5         |     10     |  10.689 ms |       2.41% |   9.813 ms |       2.01% |   -875.552 us |  -8.19% |   FAST   |
| I32 | 11  |        20         |   100000   |        0.5         |     50     |   4.500 ms |       3.35% |   3.508 ms |       3.72% |   -992.328 us | -22.05% |   FAST   |
| I32 | 11  |        100        |   100000   |        0.5         |     50     |   4.495 ms |       3.45% |   3.491 ms |       3.37% |  -1004.745 us | -22.35% |   FAST   |
| I32 | 11  |        20         |  10000000  |        0.5         |     50     |  62.091 ms |       1.07% |  53.808 ms |       1.20% |  -8283.342 us | -13.34% |   FAST   |
| I32 | 11  |        100        |  10000000  |        0.5         |     50     |  49.054 ms |       1.21% |  44.513 ms |       1.31% |  -4540.697 us |  -9.26% |   FAST   |
| I32 | 11  |        20         |   100000   |        0.5         |    100     |   9.550 ms |       2.51% |   7.594 ms |       2.89% |  -1956.117 us | -20.48% |   FAST   |
| I32 | 11  |        100        |   100000   |        0.5         |    100     |   9.728 ms |       3.06% |   7.761 ms |       3.45% |  -1966.969 us | -20.22% |   FAST   |
| I32 | 11  |        20         |  10000000  |        0.5         |    100     | 123.197 ms |       0.72% | 106.395 ms |       0.69% | -16802.717 us | -13.64% |   FAST   |
| I32 | 11  |        100        |  10000000  |        0.5         |    100     |  97.738 ms |       0.77% |  88.312 ms |       0.79% |  -9426.127 us |  -9.64% |   FAST   |
| F64 | 11  |        20         |   100000   |        0.5         |     1      | 194.605 us |      17.00% | 167.257 us |       6.73% |    -27.348 us | -14.05% |   FAST   |
| F64 | 11  |        100        |   100000   |        0.5         |     1      | 191.900 us |       9.32% | 166.871 us |       8.15% |    -25.028 us | -13.04% |   FAST   |
| F64 | 11  |        20         |  10000000  |        0.5         |     1      |   3.637 ms |       2.74% |   3.464 ms |       2.22% |   -173.288 us |  -4.76% |   FAST   |
| F64 | 11  |        100        |  10000000  |        0.5         |     1      |   2.442 ms |       2.52% |   2.349 ms |       2.57% |    -92.728 us |  -3.80% |   FAST   |
| F64 | 11  |        20         |   100000   |        0.5         |     10     |   1.011 ms |       4.37% | 810.755 us |       2.59% |   -200.113 us | -19.80% |   FAST   |
| F64 | 11  |        100        |   100000   |        0.5         |     10     |   1.018 ms |       5.26% | 816.647 us |       3.62% |   -201.221 us | -19.77% |   FAST   |
| F64 | 11  |        20         |  10000000  |        0.5         |     10     |  15.227 ms |       2.05% |  13.553 ms |       1.99% |  -1674.619 us | -11.00% |   FAST   |
| F64 | 11  |        100        |  10000000  |        0.5         |     10     |  11.207 ms |       2.31% |  10.308 ms |       2.42% |   -898.704 us |  -8.02% |   FAST   |
| F64 | 11  |        20         |   100000   |        0.5         |     50     |   4.757 ms |       3.66% |   3.755 ms |       3.18% |  -1001.763 us | -21.06% |   FAST   |
| F64 | 11  |        100        |   100000   |        0.5         |     50     |   4.685 ms |       3.73% |   3.692 ms |       3.57% |   -992.994 us | -21.19% |   FAST   |
| F64 | 11  |        20         |  10000000  |        0.5         |     50     |  67.446 ms |       0.93% |  58.894 ms |       1.06% |  -8551.881 us | -12.68% |   FAST   |
| F64 | 11  |        100        |  10000000  |        0.5         |     50     |  50.583 ms |       1.28% |  46.099 ms |       1.24% |  -4484.616 us |  -8.87% |   FAST   |
| F64 | 11  |        20         |   100000   |        0.5         |    100     |   9.934 ms |       2.77% |   7.993 ms |       3.10% |  -1941.621 us | -19.54% |   FAST   |
| F64 | 11  |        100        |   100000   |        0.5         |    100     |   9.903 ms |       3.57% |   7.926 ms |       3.88% |  -1976.553 us | -19.96% |   FAST   |
| F64 | 11  |        20         |  10000000  |        0.5         |    100     | 133.089 ms |       0.57% | 115.708 ms |       0.65% | -17380.691 us | -13.06% |   FAST   |
| F64 | 11  |        100        |  10000000  |        0.5         |    100     | 100.097 ms |       0.68% |  90.714 ms |       0.78% |  -9382.765 us |  -9.37% |   FAST   |

For MEAN aggregation, the benchmark shows less change in performance, less than 10% improvement:

## [0] Quadro RTX 6000

|  T  |  U  |  value_key_ratio  |  num_rows  |  null_probability  |  num_aggs  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |        Diff |   %Diff |  Status  |
|-----|-----|-------------------|------------|--------------------|------------|------------|-------------|------------|-------------|-------------|---------|----------|
| I32 | 11  |        20         |   100000   |        0.5         |     1      | 176.381 us |      45.99% | 161.916 us |      14.81% |  -14.465 us |  -8.20% |   SAME   |
| I32 | 11  |        100        |   100000   |        0.5         |     1      | 165.863 us |      35.52% | 162.881 us |      37.47% |   -2.981 us |  -1.80% |   SAME   |
| I32 | 11  |        20         |  10000000  |        0.5         |     1      |   3.682 ms |       6.62% |   3.872 ms |       7.68% |  189.568 us |   5.15% |   SAME   |
| I32 | 11  |        100        |  10000000  |        0.5         |     1      |   2.941 ms |      10.32% |   2.842 ms |       8.32% |  -98.556 us |  -3.35% |   SAME   |
| I32 | 11  |        20         |   100000   |        0.5         |     10     | 756.120 us |      17.72% | 701.317 us |       7.39% |  -54.804 us |  -7.25% |   SAME   |
| I32 | 11  |        100        |   100000   |        0.5         |     10     | 668.265 us |       7.67% | 710.291 us |      14.59% |   42.026 us |   6.29% |   SAME   |
| I32 | 11  |        20         |  10000000  |        0.5         |     10     |   9.150 ms |       4.02% |   9.262 ms |       4.18% |  111.320 us |   1.22% |   SAME   |
| I32 | 11  |        100        |  10000000  |        0.5         |     10     |   7.130 ms |       4.82% |   7.176 ms |       4.95% |   45.476 us |   0.64% |   SAME   |
| I32 | 11  |        20         |   100000   |        0.5         |     50     |   3.262 ms |       6.30% |   3.252 ms |       5.51% |  -10.407 us |  -0.32% |   SAME   |
| I32 | 11  |        100        |   100000   |        0.5         |     50     |   3.097 ms |       6.51% |   3.138 ms |       5.74% |   40.987 us |   1.32% |   SAME   |
| I32 | 11  |        20         |  10000000  |        0.5         |     50     |  37.531 ms |       4.51% |  37.754 ms |       4.74% |  223.082 us |   0.59% |   SAME   |
| I32 | 11  |        100        |  10000000  |        0.5         |     50     |  30.869 ms |       4.64% |  31.073 ms |       5.08% |  204.568 us |   0.66% |   SAME   |
| I32 | 11  |        20         |   100000   |        0.5         |    100     |   6.595 ms |       5.15% |   6.640 ms |       6.10% |   44.763 us |   0.68% |   SAME   |
| I32 | 11  |        100        |   100000   |        0.5         |    100     |   6.507 ms |       3.29% |   6.219 ms |       5.90% | -287.596 us |  -4.42% |   FAST   |
| I32 | 11  |        20         |  10000000  |        0.5         |    100     |  74.641 ms |       4.89% |  74.342 ms |       4.56% | -298.932 us |  -0.40% |   SAME   |
| I32 | 11  |        100        |  10000000  |        0.5         |    100     |  61.335 ms |       4.96% |  60.726 ms |       4.10% | -609.406 us |  -0.99% |   SAME   |
| F64 | 11  |        20         |   100000   |        0.5         |     1      | 167.726 us |      39.44% | 165.287 us |      30.44% |   -2.438 us |  -1.45% |   SAME   |
| F64 | 11  |        100        |   100000   |        0.5         |     1      | 171.336 us |      48.40% | 155.570 us |      10.20% |  -15.766 us |  -9.20% |   SAME   |
| F64 | 11  |        20         |  10000000  |        0.5         |     1      |   3.766 ms |       7.98% |   3.773 ms |       7.11% |    7.286 us |   0.19% |   SAME   |
| F64 | 11  |        100        |  10000000  |        0.5         |     1      |   3.048 ms |      11.82% |   2.936 ms |       8.59% | -112.381 us |  -3.69% |   SAME   |
| F64 | 11  |        20         |   100000   |        0.5         |     10     | 772.546 us |      20.00% | 749.889 us |      12.22% |  -22.657 us |  -2.93% |   SAME   |
| F64 | 11  |        100        |   100000   |        0.5         |     10     | 741.875 us |      17.05% | 729.755 us |      11.68% |  -12.120 us |  -1.63% |   SAME   |
| F64 | 11  |        20         |  10000000  |        0.5         |     10     |   9.796 ms |       4.69% |   9.909 ms |       4.53% |  113.200 us |   1.16% |   SAME   |
| F64 | 11  |        100        |  10000000  |        0.5         |     10     |   7.509 ms |       4.69% |   7.528 ms |       4.89% |   19.545 us |   0.26% |   SAME   |
| F64 | 11  |        20         |   100000   |        0.5         |     50     |   3.424 ms |       7.12% |   3.356 ms |       5.42% |  -68.060 us |  -1.99% |   SAME   |
| F64 | 11  |        100        |   100000   |        0.5         |     50     |   3.311 ms |       6.95% |   3.313 ms |       5.69% |    1.903 us |   0.06% |   SAME   |
| F64 | 11  |        20         |  10000000  |        0.5         |     50     |  40.095 ms |       4.40% |  40.201 ms |       4.70% |  106.728 us |   0.27% |   SAME   |
| F64 | 11  |        100        |  10000000  |        0.5         |     50     |  31.914 ms |       4.66% |  32.028 ms |       5.00% |  113.314 us |   0.36% |   SAME   |
| F64 | 11  |        20         |   100000   |        0.5         |    100     |   7.033 ms |       3.82% |   6.742 ms |       5.42% | -290.792 us |  -4.13% |   FAST   |
| F64 | 11  |        100        |   100000   |        0.5         |    100     |   6.586 ms |       5.28% |   6.603 ms |       5.18% |   17.229 us |   0.26% |   SAME   |
| F64 | 11  |        20         |  10000000  |        0.5         |    100     |  78.342 ms |       4.80% |  78.520 ms |       5.05% |  177.969 us |   0.23% |   SAME   |
| F64 | 11  |        100        |  10000000  |        0.5         |    100     |  62.989 ms |       5.05% |  62.435 ms |       4.36% | -554.394 us |  -0.88% |   SAME   |

Signed-off-by: Nghia Truong <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue Spark Functionality that helps Spark RAPIDS

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[FEA] Reduce overhead when computing compound aggregations in hash-based groupby

2 participants