Investigate whether exact bucket is necessary #46

brancz · 2021-01-06T13:40:32Z

Today, a latency target must exist as an exact histogram bucket. This is both difficult to discover as a user, and brittle as it depends on the application code, which can easily be automatically updated in the usage of it when using the slo-libsonnet library.

I don't know if it's possible at all, but I think we should try to figure out if there could be a better way to calculate the same, or something similar or close to it, without requiring knowing the bucket in advance.

@metalmatze

cc @beorn7 (maybe you have an idea and it's entirely obvious and we just missed it here :) )

beorn7 · 2021-01-06T13:59:14Z

Currently, the cleanest way is to encode SLO targets into the instrumented service binary itself and let it expose request counters for "SLO met" or "SLO missed".

Of course, that has the ugliness of encoding SLOs somewhere where you might not want to see them. (I some setups, I can actually imagine SLOs being seen as "owned" by the binary. But in others, you might want to apply them from the outside, perhaps even retroactively, e.g. "If your SLO was this and that, would we have met it over the last quarter?".)

The current solution with Prometheus Histograms is kind of in between because you have to know at least about interesting thresholds when instrumenting the binary, while you have some limited freedom when applying SLO rules from the outside.

The expensive way would be to have a high resolution histogram, which gives you many exact thresholds, but would also allow interpolations with a not too high error margin.

The histogram plans I'm working on documenting right now (I set myself a deadline for end of this month) will make high-res histograms much more feasible, so that you could go for that interpolation approach.

metalmatze · 2021-01-07T16:00:49Z

Yes! I'm quite excited for your experiments!
Indeed for binaries one doesn't own it's quite tricky.

In case of Kuberentes APIServer I actually went with the le=40 bucket even though SIG Scalability has a 99% success target for requests within 30s. 🤷‍♂️ I guess in that case it was still close enough but I have definitely see cases where this isn't as close...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate whether exact bucket is necessary #46

Investigate whether exact bucket is necessary #46

brancz commented Jan 6, 2021

beorn7 commented Jan 6, 2021

metalmatze commented Jan 7, 2021

Investigate whether exact bucket is necessary #46

Investigate whether exact bucket is necessary #46

Comments

brancz commented Jan 6, 2021

beorn7 commented Jan 6, 2021

metalmatze commented Jan 7, 2021