-
Notifications
You must be signed in to change notification settings - Fork 18
Introduce materialization for operations #247
base: develop
Are you sure you want to change the base?
Conversation
This commit introduces a materialization to record the parent-child relation between operations. Namely for and span of operation P that is the parent of a span with operation C, we record the 10-minute bucketed relation: - bucket time - id of operation P - id of operation C - count() This gives us a way to quickly aggregate the service map. The materialization is done like a continous aggregate. But, we couldn't use continuous aggs since they don't support self-joins. This materialization support real-time aggregation where the materialization is combined with a freshly-calculated aggregation of the not-yet materialized head of the data. This is available in the _ps_trace.operation_stats view. A limitation of this approach is that we don't support invalidations so if a new span comes in for a time-period that has already been materialized it is never corrected. Currently we materialize data that is no newer than 10-minutes prior to the most recent data. A few assumptions are made in the materialization: - That a child span's start_time is >= parent span's start time. This seem obvious but I couldn't find it documented in the spec. - A child span's start_time is < parent span's start_time + 1 hour. This is more questionable. But I think it's ok if we document the limitation. The background job to materialize this data is set up to be separate from the rest of the jobs for isolation and also because we always want to run this every 10 minutes and don't want parallelism. Fixed parameters we may want to make configurable later: - Size of bucket (currently 10 minutes) - Size aggregated in one loop/txn (1 bucket) - Invalidation window (10 minutes) - chunk size of materialized hypertable (7 days)
Perhaps worth adding the max parent age is worth it as well. We never know what the end system is going to be like. I reckon on a system where the delta of >1h is the case it wont be as significant of a performance penalty either. |
|
||
--correctness check | ||
SELECT * FROM real_view | ||
EXCEPT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat :-)
_query_start := clock_timestamp (); | ||
INSERT INTO _ps_trace.operation_materialization | ||
SELECT | ||
public.time_bucket(_bucket, parent.start_time) bucket, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public.time_bucket(_bucket, parent.start_time) bucket, | |
public.time_bucket(_bucket, parent.start_time) as bucket, |
child.operation_id as child_operation_id, | ||
count(*) as cnt | ||
FROM | ||
_ps_trace.span parent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_ps_trace.span parent | |
_ps_trace.span as parent |
FROM | ||
_ps_trace.span parent | ||
INNER JOIN | ||
_ps_trace.span child ON (parent.span_id = child.parent_span_id AND parent.trace_id = child.trace_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_ps_trace.span child ON (parent.span_id = child.parent_span_id AND parent.trace_id = child.trace_id) | |
_ps_trace.span as child ON (parent.span_id = child.parent_span_id AND parent.trace_id = child.trace_id) |
|
||
--we materialize one bucket at a time until we are done. We do this so that | ||
--we can make steady progress even if we are pretty far behind. | ||
WHILE (_start + _bucket) <= _global_end LOOP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not certain this should be <=
and not just <
. Can you please elaborate on this?
INNER JOIN | ||
_ps_trace.span child ON (parent.span_id = child.parent_span_id AND parent.trace_id = child.trace_id) | ||
WHERE | ||
parent.start_time >= COALESCE(_ps_trace.materialize_watermark(), '-infinity'::timestamp with time zone) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parent.start_time >= COALESCE(_ps_trace.materialize_watermark(), '-infinity'::timestamp with time zone) | |
parent.start_time >= COALESCE(_ps_trace.materialize_watermark(), '-infinity'::pg_catalog.timestamptz) |
WHERE | ||
parent.start_time >= COALESCE(_ps_trace.materialize_watermark(), '-infinity'::timestamp with time zone) | ||
--assumes the child span starts on or after the parent starts | ||
AND child.start_time >= COALESCE(_ps_trace.materialize_watermark(), '-infinity'::timestamp with time zone) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AND child.start_time >= COALESCE(_ps_trace.materialize_watermark(), '-infinity'::timestamp with time zone) | |
AND child.start_time >= COALESCE(_ps_trace.materialize_watermark(), '-infinity'::pg_catalog.timestamptz) |
WHERE parent_start_time < COALESCE(_ps_trace.materialize_watermark(), '-infinity'::timestamp with time zone) | ||
UNION ALL | ||
SELECT | ||
public.time_bucket('10 minute', parent.start_time) parent_start_time, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps worth to put the interval into an in-linable config-function.
child_operation_id bigint, | ||
cnt bigint, | ||
--todo what else, | ||
UNIQUE(parent_start_time, parent_operation_id, child_operation_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps worth upgrading to a primary key
to enforce NOT NULL
s.
--keep this separate from the other jobs for independence and also since the right | ||
--schedule here is always "ever 10 min". We never want this to run in parallel so | ||
--it's always just one job making progress. This job is usally pretty quiet so enable | ||
--logging by default (should chirp once every 10 min). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I agree with this. It is quiet indeed, but what benefit does it provide? Perhaps it's better to expose the duration (or the end_time) in the job table (_ps_trace.operation_materialization
) so that the user can assess the performance change of this over time and consider changing the bucket size.
If this is a file log entry then I'd see this as yet another line I wouldn't care about, since a single entry doesn't give you any information at all.
This commit introduces a materialization to record
the parent-child relation between operations. Namely for
and span of operation P that is the parent of a span with
operation C, we record the 10-minute bucketed relation:
This gives us a way to quickly aggregate the service map.
The materialization is done like a continous aggregate. But,
we couldn't use continuous aggs since they don't support
self-joins.
This materialization support real-time aggregation where the
materialization is combined with a freshly-calculated aggregation
of the not-yet materialized head of the data. This is available
in the _ps_trace.operation_stats view.
A limitation of this approach is that we don't support invalidations
so if a new span comes in for a time-period that has already been
materialized it is never corrected. Currently we materialize
data that is no newer than 10-minutes prior to the most
recent data.
A few assumptions are made in the materialization:
start time. This seem obvious but I couldn't find
it documented in the spec.
This is more questionable. But I think it's ok if we
document the limitation.
The background job to materialize this data is set up to be separate
from the rest of the jobs for isolation and also because we always
want to run this every 10 minutes and don't want parallelism.
Fixed parameters we may want to make configurable later: