Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stage_agg duration_in should work as expected #780

Open
leppaott opened this issue Oct 24, 2023 · 2 comments
Open

stage_agg duration_in should work as expected #780

leppaott opened this issue Oct 24, 2023 · 2 comments
Labels
feature-request And area of analysis that could be made easier

Comments

@leppaott
Copy link

leppaott commented Oct 24, 2023

I may misunderstand the state_agg as it seems to calculate only deltas i.e. state transitions but it never considers the final state.

To demonstrate a quick test from the examples.

CREATE TABLE email_status (
  id BIGINT,
  ts TIMESTAMPTZ,
  status TEXT
);

SELECT create_hypertable('email_status','ts');

INSERT INTO email_status("ts", "id", "status")
VALUES
('2023-10-24 12:01:00',1,'draft');

SELECT time_bucket('5 minutes'::interval, ts) AS bucket,
  id,
  duration_in(state_agg(ts, status) , 'draft')
FROM email_status
GROUP BY bucket, id;

There's no state transition here so the result is also:

bucket                       |id|duration_in|
-----------------------------+--+-----------+
2023-10-24 12:00:00.000 +0300| 1|   00:00:00|

If we add the first "state transition" to another state 2 minutes later.

INSERT INTO email_status("ts", "id", "status")
VALUES
('2023-10-24 12:03:00',1,'saved');

We get:

bucket                       |id|duration_in|
-----------------------------+--+-----------+
2023-10-24 12:00:00.000 +0300| 1|   00:02:00|

Now the issue is when

a) state transitions don't happen within one bucket (see #652)
b) state transitions end sometime how do we calculate the duration in relation to the bucket time/origin?

Now if we query again the saved state:

SELECT time_bucket('5 minutes'::interval, ts) AS bucket,
  id,
  duration_in(state_agg(ts, status) , 'saved')
FROM email_status
GROUP BY bucket, id;

Surely we can reasonably expect the duration to be the remaining time on the bucket i.e. 3 minutes when it was inserted at 2 minutes as that's the duration of that state within that single bucket (assuming we don't do the query in the middle of the bucket but 5 minutes later).

But this is what we get back to zero:

bucket                       |id|duration_in|
-----------------------------+--+-----------+
2023-10-24 12:00:00.000 +0300| 1|   00:00:00|

Basically I think there should be a flag or even the default behavior for durations where the end_time is NULL and where we can return the bucket_width - (start_time - bucket_time)
or now() - start_time if it's a ongoing real-time bucket.

How have I tried to gapfill this or solve this?

I tried time_bucket_gapfill with locf carrying a status to following buckets, this is done in 1s or 1ms buckets which can be then combined with a 5 minute time_bucket. This works but for starters is very slow and unpractical for any larger time-windows.

SELECT "bucket",
      "uuid",
     duration_in("statusAgg", 'ACTIVE') AS "activeTime"
  FROM
  (SELECT time_bucket(INTERVAL '5 minutes', "time") AS "bucket",
	  "uuid",
	  state_agg("time", cast_to_text("status")) AS "statusAgg"
	  FROM
	  (select time_bucket_gapfill(INTERVAL '1 s', "time") AS "time",
	    "uuid",
	    locf(last("status", "time")) as "status"
	    FROM "StatusHistory"
	    WHERE "time" >= NOW() - INTERVAL '12 hours' AND "time" <= NOW()
	    AND "uuid"::text = '83e5858c-202c-43d0-a35e-00c28ee87152'
		GROUP by 1, 2) sub
	  GROUP BY 1, 2) ss
ORDER BY "bucket" DESC

One issue here is that continuous aggregates don't support time_bucket_gapfill so this can't be aggregated as far as I can tell. If there's any way to improve this, please let me know. Secondly this one refreshes only the 12 hours so if the final state change was even before that duration will be wrong.

Our first plan was to simply do hierarchical caggs like:

CREATE MATERIALIZED VIEW "HistoryAgg5Minute"
WITH (timescaledb.continuous) AS
SELECT time_bucket(INTERVAL '5 minutes', "time") AS "bucket",
  uuid,
  state_agg("time", cast_to_text("status")) AS "statusAgg"
  FROM "StatusHistory"
  GROUP BY 1, 2
WITH NO DATA;

and stack it with 1hour, 1 week aggregates etc. And we expected this would as simple as that and then use duration_in to get how long each status is within in each bucket - but of course the issue is partly also that there's no bucket for all time spans either. That could've been tolerated and programmatically fill in the blanks perhaps (although hard too) if at least the bucket in which state final state transition happens would reflect the state correctly.

I'm not the only one perplexed with this behavior:

#652
#479
https://www.timescale.com/forum/t/trouble-getting-interpolated-duration-in-to-show-expected-results/2059
https://www.timescale.com/forum/t/odd-results-with-state-aggregation-duration-in-across-time-bucket-boundaries/2156

Sum up:
To quote the docs from compact_state_agg which I believe should function similarly.

For example, you can use the compact_state_agg functions to track how much time a system spends in error, running, or starting states.

If the system reports once that it's running the aggregation reports that the duration it's been running is 0 if there's no transition after this state. I don't think it should work like this and this is more of a bug that the ending isn't handled.

@leppaott leppaott added the feature-request And area of analysis that could be made easier label Oct 24, 2023
@leppaott
Copy link
Author

leppaott commented Oct 25, 2023

interpolated_duration_in should work https://github.com/timescale/timescaledb-toolkit/blob/main/extension/src/state_aggregate.rs#L2383 almost works. Well I guess it does work except we can't gapfill more buckets where the interpolation would work.

EDIT: Ok figured it out. We can gapfill manually with a add_job that creates every bucket on the history table for every item every beginning of a new bucket - wasteful/inefficient as can't just gapfill on the aggregate itself. Then duration_in (or perhaps interpolated but we can just sum) can be used to calculate the ending duration up to now() or the end of the bucket. Wastes a lot of space!

What would be ideal query:

CREATE MATERIALIZED VIEW "HistoryAgg5Minute"
WITH (timescaledb.continuous) AS
SELECT time_bucket_gapfill(INTERVAL '5 minutes', "time") AS "bucket",
  uuid,
  state_agg("time", locf("status")) AS "statusAgg"
  FROM "StatusHistory"
  GROUP BY 1, 2
WITH NO DATA;

If gapfill supported caggs + locf could be nested like this inside state_agg. I think currently locf doesn't support it and not sure if there could be another aggregate that would aggregate locf(status).

So perhaps this PR isn't valid as interpolated_duration_in does what I described mostly and this was mostly about experience of using these. Feel free to close.

@leppaott
Copy link
Author

An example from https://github.com/timescale/timescaledb-toolkit/blob/main/extension/src/state_aggregate.rs#L2021

r#"INSERT INTO test VALUES
                ('2020-01-01 00:00:00+00', 'one'),
                ('2020-01-01 00:02:00+00', 'one'),
                ('2020-01-01 00:01:00+00', 'two'),
                ('2020-12-31 00:02:00+00', 'end')

where in the real world there is state end? Compact_state_agg is a good idea and also would be nice to have rollup(): StateAgg -> CompactStateAgg as a way to compact hierarchical aggs.

@leppaott leppaott reopened this Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request And area of analysis that could be made easier
Projects
None yet
Development

No branches or pull requests

1 participant