Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate whether exact bucket is necessary #46

Open
brancz opened this issue Jan 6, 2021 · 2 comments
Open

Investigate whether exact bucket is necessary #46

brancz opened this issue Jan 6, 2021 · 2 comments

Comments

@brancz
Copy link
Contributor

brancz commented Jan 6, 2021

Follow up of #44 (comment)

Today, a latency target must exist as an exact histogram bucket. This is both difficult to discover as a user, and brittle as it depends on the application code, which can easily be automatically updated in the usage of it when using the slo-libsonnet library.

I don't know if it's possible at all, but I think we should try to figure out if there could be a better way to calculate the same, or something similar or close to it, without requiring knowing the bucket in advance.

@metalmatze

cc @beorn7 (maybe you have an idea and it's entirely obvious and we just missed it here :) )

@beorn7
Copy link

beorn7 commented Jan 6, 2021

Currently, the cleanest way is to encode SLO targets into the instrumented service binary itself and let it expose request counters for "SLO met" or "SLO missed".

Of course, that has the ugliness of encoding SLOs somewhere where you might not want to see them. (I some setups, I can actually imagine SLOs being seen as "owned" by the binary. But in others, you might want to apply them from the outside, perhaps even retroactively, e.g. "If your SLO was this and that, would we have met it over the last quarter?".)

The current solution with Prometheus Histograms is kind of in between because you have to know at least about interesting thresholds when instrumenting the binary, while you have some limited freedom when applying SLO rules from the outside.

The expensive way would be to have a high resolution histogram, which gives you many exact thresholds, but would also allow interpolations with a not too high error margin.

The histogram plans I'm working on documenting right now (I set myself a deadline for end of this month) will make high-res histograms much more feasible, so that you could go for that interpolation approach.

@metalmatze
Copy link
Owner

Yes! I'm quite excited for your experiments!
Indeed for binaries one doesn't own it's quite tricky.

In case of Kuberentes APIServer I actually went with the le=40 bucket even though SIG Scalability has a 99% success target for requests within 30s. 🤷‍♂️ I guess in that case it was still close enough but I have definitely see cases where this isn't as close...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants