-
-
Notifications
You must be signed in to change notification settings - Fork 825
Description
The calendar_interval parameter is not supported currently in the date histogram aggregation, this is an outline on its challenges and drafting solutions.
Unlike fixed_interval, calendar_interval may have intervals of different sizes, depending on which timestamp-ranges the months/years/etc. map.
Fixed Interval DateHistogram
Currently the date histogram collection reuses the histogram implementation, which collects sparse by default. That means we don't allocate e.g. a Vec upfront, instead we have a Hashmap:
buckets: FxHashMap<i64, SegmentHistogramBucketEntry>,For every timestamp, we truncate to the nearest bucket timestamp and collect into it.
This behavior allows for "drill-down", where we apply a filter and get a high resolution histogram. Preallocating over min-max of the column may OOM in these cases.
Calendar Aware DateHistogram
With the calendar aware date histogram we have two value spaces, the data stored as UTC and the data converted into a timezone. We want to avoid converting every fetched timestamp into its timezone specific counterpart, ideally the buckets should reflect that.
The simplest solution for calendar_interval would be to reuse the range aggregation by preallocating the ranges. This has two problems:
- Filter + high resolution may OOM due to too many buckets
- A binary_search to find the bucket may be slow
Potential Solutions
- A multi-level data structure that preallocates the top-level and is lazy on lower levels
- group buckets into fixed interval ranges and have a similar algorithm as now inside a group, where we truncate to the closest bucket with some metadata