-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
monitoring: tweak search request duration metric to record success / failure #507
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a quick look as I'm trying to learn more about our observability features.
Help: "The duration a search request took in seconds", | ||
Buckets: prometheus.DefBuckets, // DefBuckets good for service timings | ||
}) | ||
metricSearchDuration = promauto.NewHistogramVec( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious how prometheus handles changes to metric types. Maybe because they're both histograms, it's not really a change and now there will just be labels?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Prometheus' data model, time series are uniquely identified by their name (zoekt_search_duration_seconds
) and any attached labels (success
). In my interpretation, this means that adding a new label creates a new time series. When we write Grafana dasbhoards that query on zoekt_search_duration_seconds{success=true}
, the old data points (from zoekt_search_duration_seconds
observations that occurred before we added the success
label ) won't be returned from the Prometheus server.
Does this answer your question?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, thanks for the reference.
This tweaks the search request duration metric to record whether or not the search itself was successful. This follows the advice from Google SRE's Monitoring Distributed Systems: