Skip to content

calendar_interval in datehistogram #2459

@PSeitz

Description

@PSeitz

The calendar_interval parameter is not supported currently in the date histogram aggregation, this is an outline on its challenges and drafting solutions.
Unlike fixed_interval, calendar_interval may have intervals of different sizes, depending on which timestamp-ranges the months/years/etc. map.

Fixed Interval DateHistogram

Currently the date histogram collection reuses the histogram implementation, which collects sparse by default. That means we don't allocate e.g. a Vec upfront, instead we have a Hashmap:

buckets: FxHashMap<i64, SegmentHistogramBucketEntry>,

For every timestamp, we truncate to the nearest bucket timestamp and collect into it.
This behavior allows for "drill-down", where we apply a filter and get a high resolution histogram. Preallocating over min-max of the column may OOM in these cases.

Calendar Aware DateHistogram

With the calendar aware date histogram we have two value spaces, the data stored as UTC and the data converted into a timezone. We want to avoid converting every fetched timestamp into its timezone specific counterpart, ideally the buckets should reflect that.

The simplest solution for calendar_interval would be to reuse the range aggregation by preallocating the ranges. This has two problems:

  • Filter + high resolution may OOM due to too many buckets
  • A binary_search to find the bucket may be slow

Potential Solutions

  • A multi-level data structure that preallocates the top-level and is lazy on lower levels
  • group buckets into fixed interval ranges and have a similar algorithm as now inside a group, where we truncate to the closest bucket with some metadata

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions