Skip to content
This repository was archived by the owner on Apr 2, 2024. It is now read-only.

WIP: Adds support for Prometheus metric-rollups #594

Draft
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

Harkishen-Singh
Copy link
Member

Description

TODO:

  • irate() and counter_reset_sum() SQL aggregate

Merge requirements

Please take into account the following non-code changes that you may need to make with your PR:

  • CHANGELOG entry for user-facing changes
  • Updated the relevant documentation

Signed-off-by: Harkishen-Singh <[email protected]>

This commit does 3 things on a high level.
1. create_metric_rollup is a straightforward that creates schema based on rollup name, prefixed with 'ps_'.

2. Functions to create metric rollup depending on metric type. This is done by
a function scan_for_new_rollups that runs every 30 mins. This function looks for
pending metrics (by peeking inside metric_with_rollup table) that need rollups
and creates them. The creation path is
scan_for_new_rollups -> create_metric_rollup_view (decides the metric type) -> create_rollup_for_{gauge|counter|summary}.
Counter and Histogram have similar columns, hence handled by the same function.

3. Adds set of utility functions like counter_reset_sum and irate, which are required for rollup creation. These
are temporary functions and they MUST be removed before the MVP.

Note: These functions are just dummies, their behaviour is not 100% correct.
Example: These functions require ordered values array, but we cannot do ORDER BY in Caggs query.
Hence, these functions will be handled by SQL aggregate written in Rust.
Signed-off-by: Harkishen-Singh <[email protected]>
Signed-off-by: Harkishen-Singh <[email protected]>

This commit does 2 things:
1. Update the prom_api.register_metric_view() to accept refresh_interval information
2. Creates a Cagg refresher job that is per resolution. This job is created by `create_cagg_refresh_job`
after confirming no job related to the given resolution exists for refresh

Note:
1. We need to register the newly created metric-rollups using register_metric_view(). This is done in
the `scan_for_new_rollups` after storing the new Caggs metadata in a temp table. This is to register the Caggs
only after all views are created, otherwise we get a Error while creating materialized view, saying the Caggs
already exists.

2. In register_metric_view() table, we skip the checks offered by `get_first_level_view_on_metric()` since based
on a quick investigation, these checks do not work when the Caggs are created with `timescaledb.materialized=true`. Moreover,
since we are creating Caggs internally in metric-rollups, we need not worry about Cagg view created (which is done by those checks).
Signed-off-by: Harkishen-Singh <[email protected]>

This commit does 2 things: handle compression and retention for metric-rollups and
custom Caggs for downsampling. The comments on `How?` are present as comments
on the respective functions.
…ompressing Caggs.

Signed-off-by: Harkishen-Singh <[email protected]>

This PR does 2 things:

1. Fixes the exception Caggs already exists that was caused when a cagg for one
resolution was again created when the `scan_for_new_rollups()` was re-executed. Added a
test under `metric_rollups_scan_for_new_rollups` to ensure correct behaviour.

2. `execute_caggs_compression_policy()` used to panic when looping over already compressed
chunk of a Cagg that has some uncompressed chunks that need compression. Added rerunning
compression policy to ensure correct behaviour.
Signed-off-by: Harkishen-Singh <[email protected]>
Signed-off-by: Harkishen-Singh <[email protected]>

This commit increases the refresh range for a Cagg. This is important since
earlier, we used to account refresh only for 1 bucket. This led to data loss
in systems with high metircs/series since the refresh took a lot of time
to surpass the refresh range.

This commit addresses this issue by adding a _safety_refresh_start_buffer
that allows to start refresh from an earlier point for a bigger refresh range
in small downsampling scenarios like 5 minute, thereby preventing sample loss.

This is usually not a problem for long downsampling scenarios like 1 hour.
Signed-off-by: Harkishen-Singh <[email protected]>
@@ -2562,7 +2564,7 @@ LANGUAGE PLPGSQL;
REVOKE ALL ON FUNCTION _prom_catalog.create_metric_view(text) FROM PUBLIC;
GRANT EXECUTE ON FUNCTION _prom_catalog.create_metric_view(text) TO prom_writer;

CREATE OR REPLACE FUNCTION prom_api.register_metric_view(schema_name text, view_name text, if_not_exists BOOLEAN = false)
CREATE OR REPLACE FUNCTION prom_api.register_metric_view(schema_name text, view_name text, refresh_interval INTERVAL, if_not_exists BOOLEAN = false, downsample_id BIGINT = NULL)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: Documentation changes referring to refresh_interval.

_active_chunk_interval INTERVAL;
_ignore_downsampling_clause TEXT := '';
_refresh_downsampled_data BOOLEAN := (SELECT prom_api.get_global_downsampling_state()::BOOLEAN);
_should_refresh_clause TEXT := 'AND (SELECT should_refresh FROM _prom_catalog.downsample WHERE id = m.downsample_id) = TRUE';
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should here be a coalesce() otherwise its a bug, since its not gonna update for custom Caggs when _refresh_downsampled_data is true.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not, since for custom Caggs, downsampling_id will be NULL, so this clause has no effect?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant