Skip to content

Add rare_terms bucket aggregation #9826

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
May 20, 2025

Conversation

dwelsch-esi
Copy link
Contributor

Description

Add rare_terms bucket aggregation.

Issues Resolved

Version

Frontend features

Checklist

  • By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and subject to the Developers Certificate of Origin.
    For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link

github-actions bot commented May 6, 2025

Thank you for submitting your PR. The PR states are In progress (or Draft) -> Tech review -> Doc review -> Editorial review -> Merged.

Before you submit your PR for doc review, make sure the content is technically accurate. If you need help finding a tech reviewer, tag a maintainer.

When you're ready for doc review, tag the assignee of this PR. The doc reviewer may push edits to the PR directly or leave comments and editorial suggestions for you to address (let us know in a comment if you have a preference). The doc reviewer will arrange for an editorial review.

@kolchfa-aws
Copy link
Collaborator

@sandeshkr419 Could you review this PR? Thanks!

@@ -59,6 +59,9 @@ GET opensearch_dashboards_sample_data_logs/_search
The values are returned with the key `key`.
`doc_count` specifies the number of documents in each bucket. By default, the buckets are sorted in descending order of `doc-count`.

It is possible to use `terms` to search for infrequent values by ordering returned values by ascending count ( `"order": {"count": "asc")` ). We strongly discourage this practice since doing so can cause large unknown errors if multiple shards are involved. We recommend using `rare_terms` instead.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add some more clarification here:

The issue arises because a term that is globally infrequent might not appear as infrequent on every individual shard, or it might be entirely absent from the top (or, in this case, bottom) results returned by some shards. Conversely, a term might seem infrequent on one shard but be quite common on another. In both the scenarios, the rare term might be missed in individual shard results, which can lead to incorrect overall results.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this clarification, thanks!


A sibling aggregation must be a multi-bucket aggregation (have multiple grouped values for a certain field) and the metric must be a numeric value.

`min_bucket`, `max_bucket`, `sum_bucket`, and `avg_bucket` are common sibling aggregations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think OpenSearch has documentation to these sibling aggregations. Might be good to list the available sibling aggregations out in a tabular format with small description.

Same applies for parent aggregations as well.

This folder wil probably have all pipeline aggregators: https://github.com/opensearch-project/OpenSearch/tree/main/server/src/main/java/org/opensearch/search/aggregations/pipeline

For each aggregator builder classes, look for the NAME property, which should tell you that the corresponding name is one of the pipeline aggregation. Example: https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/search/aggregations/pipeline/AvgBucketPipelineAggregationBuilder.java#L48 tells that avg_bucket is one of the pipeline aggregation.

If you think there is a chance to improve java docs on the same, please open up an issue in core listing out the aggregations, where you think we need to improve on java docs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the plan is to include all parent and all sibling aggregations in detail like #9796, then probably a table of links might be a good follow-up instead of a tabular format.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do in a separate PR. There's a PR up for these aggregations #9599. Once it's merged, I'll add links to the corresponding pages.


The following parameters are optional:

- `gap_policy`: Real-world data can contain gaps or null values. You can specify the policy to deal with such missing data with the `gap_policy` property. You can either set the `gap_policy` property to `skip` to skip the missing data and continue from the next available value, or `insert_zeros` to replace the missing values with zero and continue running.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets just say this is used to handle data gaps and then link it to bottom section of the page where you talk about data gaps in detail.

It feels too much detail here when it is explained in more detail in another section.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the old version of the index page. Just merged the new version, where this is implemented.

@sandeshkr419
Copy link
Contributor

I see pipeline aggregations is also rewritten (which looks very neat now, kudos) in the same PR.
A minor suggestion is to make independent changes in independent PRs to avoid confusion, plus quicker to review small PRs.

Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
@kolchfa-aws
Copy link
Collaborator

Thank you, @sandeshkr419! I addressed your comments.

Copy link
Collaborator

@natebower natebower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kolchfa-aws Please see my comments and changes and let me know if you have any questions. Thanks!

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>
@kolchfa-aws kolchfa-aws added 6 - Done but waiting to merge PR: The work is done and ready to merge and removed 3 - Tech review PR: Tech review in progress labels May 20, 2025
@kolchfa-aws kolchfa-aws merged commit 0364466 into opensearch-project:main May 20, 2025
6 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in Performance Roadmap May 20, 2025
opensearch-trigger-bot bot pushed a commit that referenced this pull request May 20, 2025
* Refactor pipeline aggregations to match other aggregation types. Remove aggregations/pipeline-agg.md; Add aggregations/pipeline/index.md. Individual aggregation files added in other PRs.

Signed-off-by: Dave Welsch <[email protected]>

* Add rare_terms bucket aggregation.

Signed-off-by: Dave Welsch <[email protected]>

* Doc review

Signed-off-by: Fanit Kolchina <[email protected]>

* Apply suggestions from code review

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>

* Apply suggestions from code review

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>

* Remove unnecessary redirects

Signed-off-by: Fanit Kolchina <[email protected]>

---------

Signed-off-by: Dave Welsch <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>
Co-authored-by: Fanit Kolchina <[email protected]>
Co-authored-by: kolchfa-aws <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
(cherry picked from commit 0364466)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants