You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have tried using the latest recording rules and the scale of the apiserver availability rules seems off. More concretely, the rule cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase%s has an additional multiplication factor * 24 * %s. I believe the change was introduced by #976 which correctly changes the underlying data to use the bucket at {le="+Inf"}. However, since it removed the avg_over_time function in the query we retrieve the total increase over the period which should not require further scaling.
Let's just say that SLO days %s is 30d (the default) for the sake of my copy and paste. The recording rule cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d uses the metric cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d with explicit bucket label le which already has the * 24 * 30. Without any adjustment, the final rule apiserver_request:availability30d that is composed of
1 - (
sum by (cluster) (cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d{verb=~"LIST|GET"})
-
(
# too slow
(
sum by (cluster) (cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d{verb=~"LIST|GET",scope=~"resource|",le="1"})
or
vector(0)
)
+
sum by (cluster) (cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d{verb=~"LIST|GET",scope="namespace",le="5"})
+
sum by (cluster) (cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d{verb=~"LIST|GET",scope="cluster",le="30"})
)
+
# errors
sum by (cluster) (code:apiserver_request_total:increase30d{verb="read",code=~"5.."} or vector(0))
)
/
sum by (cluster) (code:apiserver_request_total:increase30d{verb="read"})
will have the hour to total day multiplication factor * 24 * 30 applied twice for the total count while the scoped counts are still calculated directly from the bucket metric itself without adjustment.
From what I can tell everything else is correct and the fix may just be to remove the multiplication factor.
Please provide any helpful snippets.
# previous rule
git checkout f4f0d150fb85b0eb4d57d8a74b387748f068e92f
make prometheus_rules.yaml
mv prometheus_rules.yaml old_rules.yaml
# new rule
git checkout a3affb372fc22fc7ddbf186743b2151fdad63aaf
make prometheus_rules.yaml
diff prometheus_rules.yaml old_rules.yaml
# 17a18,23# > - "expr": |# > sum by (cluster, verb, scope) (increase(apiserver_request_sli_duration_seconds_count{job="kube-apiserver"}[1h]))# > "record": "cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase1h"# > - "expr": |# > sum by (cluster, verb, scope) (avg_over_time(cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase1h[30d]) * 24 * 30)# > "record": "cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d"# 24,29d29# < - "expr": |# < sum by (cluster, verb, scope) (cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase1h{le="+Inf"})# < "record": "cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase1h"# < - "expr": |# < sum by (cluster, verb, scope) (cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d{le="+Inf"} * 24 * 30)# < "record": "cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d"
What happened?
I have tried using the latest recording rules and the scale of the apiserver availability rules seems off. More concretely, the rule
cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase%s
has an additional multiplication factor* 24 * %s
. I believe the change was introduced by #976 which correctly changes the underlying data to use the bucket at{le="+Inf"}
. However, since it removed theavg_over_time
function in the query we retrieve the total increase over the period which should not require further scaling.Let's just say that SLO days
%s
is30d
(the default) for the sake of my copy and paste. The recording rulecluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d
uses the metriccluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d
with explicit bucket labelle
which already has the* 24 * 30
. Without any adjustment, the final ruleapiserver_request:availability30d
that is composed ofwill have the hour to total day multiplication factor
* 24 * 30
applied twice for the total count while the scoped counts are still calculated directly from the bucket metric itself without adjustment.From what I can tell everything else is correct and the fix may just be to remove the multiplication factor.
Please provide any helpful snippets.
What parts of the codebase are affected?
Rules
I agree to the following terms:
The text was updated successfully, but these errors were encountered: