[Bug]: apiserver availability 30d recording rule time scale #990

edwintye · 2024-11-26T13:15:05Z

What happened?

I have tried using the latest recording rules and the scale of the apiserver availability rules seems off. More concretely, the rule cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase%s has an additional multiplication factor * 24 * %s. I believe the change was introduced by #976 which correctly changes the underlying data to use the bucket at {le="+Inf"}. However, since it removed the avg_over_time function in the query we retrieve the total increase over the period which should not require further scaling.

Let's just say that SLO days %s is 30d (the default) for the sake of my copy and paste. The recording rule cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d uses the metric cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d with explicit bucket label le which already has the * 24 * 30. Without any adjustment, the final rule apiserver_request:availability30d that is composed of

      1 - (
        sum by (cluster) (cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d{verb=~"LIST|GET"})
        -
        (
          # too slow
          (
            sum by (cluster) (cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d{verb=~"LIST|GET",scope=~"resource|",le="1"})
            or
            vector(0)
          )
          +
          sum by (cluster) (cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d{verb=~"LIST|GET",scope="namespace",le="5"})
          +
          sum by (cluster) (cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d{verb=~"LIST|GET",scope="cluster",le="30"})
        )
        +
        # errors
        sum by (cluster) (code:apiserver_request_total:increase30d{verb="read",code=~"5.."} or vector(0))
      )
      /
      sum by (cluster) (code:apiserver_request_total:increase30d{verb="read"})

will have the hour to total day multiplication factor * 24 * 30 applied twice for the total count while the scoped counts are still calculated directly from the bucket metric itself without adjustment.

From what I can tell everything else is correct and the fix may just be to remove the multiplication factor.

Please provide any helpful snippets.

# previous rule
git checkout f4f0d150fb85b0eb4d57d8a74b387748f068e92f
make prometheus_rules.yaml
mv prometheus_rules.yaml old_rules.yaml

# new rule
git checkout a3affb372fc22fc7ddbf186743b2151fdad63aaf
make prometheus_rules.yaml
diff prometheus_rules.yaml old_rules.yaml

# 17a18,23
# >   - "expr": |
# >       sum by (cluster, verb, scope) (increase(apiserver_request_sli_duration_seconds_count{job="kube-apiserver"}[1h]))
# >     "record": "cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase1h"
# >   - "expr": |
# >       sum by (cluster, verb, scope) (avg_over_time(cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase1h[30d]) * 24 * 30)
# >     "record": "cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d"
# 24,29d29
# <   - "expr": |
# <       sum by (cluster, verb, scope) (cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase1h{le="+Inf"})
# <     "record": "cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase1h"
# <   - "expr": |
# <       sum by (cluster, verb, scope) (cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d{le="+Inf"} * 24 * 30)
# <     "record": "cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d"

What parts of the codebase are affected?

Rules

I agree to the following terms:

I agree to follow this project's Code of Conduct.
I have filled out all the required information above to the best of my ability.
I have searched the issues of this repository and believe that this is not a duplicate.
I have confirmed this bug exists in the default branch of the repository, as of the latest commit at the time of submission.

The text was updated successfully, but these errors were encountered:

skl added bug Something isn't working keepalive Use to prevent automatic closing labels Nov 26, 2024

StevenVdBGit mentioned this issue Dec 11, 2024

[kubernetesControlPlane-prometheusRule] cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d calculation is wrong prometheus-operator/kube-prometheus#2564

Open

Daniel-Vaz mentioned this issue Dec 12, 2024

[kube-prometheus-stack] broken Availability (30d) metrics prometheus-community/helm-charts#5043

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: apiserver availability 30d recording rule time scale #990

[Bug]: apiserver availability 30d recording rule time scale #990

edwintye commented Nov 26, 2024

[Bug]: apiserver availability 30d recording rule time scale #990

[Bug]: apiserver availability 30d recording rule time scale #990

Comments

edwintye commented Nov 26, 2024

What happened?

Please provide any helpful snippets.

What parts of the codebase are affected?

I agree to the following terms: