-
-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[wip] prometheus monitoring support #26837
Conversation
corehq/apps/receiverwrapper/views.py
Outdated
|
||
datadog_counter('commcare.xform_submissions.count', tags=tags) | ||
submission_counter.tag(**tags).inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an example of a metric where we have attached multiple histogram bucket tags to a single metric. If we want to convert these histograms to the native histogram types then they must be separated out as is done here.
return enabled | ||
|
||
|
||
metrics = DelegatedMetrics(SimpleLazyObject(_get_metrics)) # singleton/global |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially I had this as a function get_metrics()
but after making it lazy there wasn't a need for the function. This will only call the _get_metrics
function when a metric class is actually used. This is needed make sure that Django is fully set up before determining if the metrics providers are enabled (since they rely on settings).
m = metrics.counter('name', 'description') # still lazy
m.inc() # now it's evaluated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that DelegatedMetrics
is always used even when there is only a single "metrics" object to delegate to? That seems wasteful. Would it be possible to only delegate if there is more than one (I would expect that to be the exceptional case).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it does mean that. The alternative is to split DelegatedMetrics
into two classes: DelegatedMetrics
and LazyMetrics
and then combine them as needed (the laziness is always needed).
_datadog_record(statsd.gauge, self.name, value, tags) | ||
|
||
|
||
class Histogram(HqHistogram): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As stated in the docstring this isn't a real histogram. It uses our workaround of adding bucketed tags.
"Did you mean to call your metric 'commcare.{}'? ".format(name)) | ||
|
||
|
||
def _validate_tag_names(tag_names): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is required by Prometheus.
self._record(value) | ||
|
||
|
||
DEFAULT_BUCKETS = (.005, .01, .025, .05, .075, .1, .25, .5, .75, 1.0, 2.5, 5.0, 7.5, 10.0, INF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default in Prometheus
from corehq.util.timer import TimingContext | ||
|
||
PROFILE_PROBABILITY = float(os.getenv('COMMCARE_PROFILE_SUBMISSION_PROBABILITY', 0)) | ||
PROFILE_LIMIT = os.getenv('COMMCARE_PROFILE_SUBMISSION_LIMIT') | ||
PROFILE_LIMIT = int(PROFILE_LIMIT) if PROFILE_LIMIT is not None else 1 | ||
|
||
corrupt_multimedia_counter = metrics.counter( | ||
'commcare.corrupt_multimedia_submissions', 'Count of requests with corrupt multimedia', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name is changed
- commcare.corrupt-multimedia-submission.error.count
+ commcare.corrupt_multimedia_submissions
I changed this because .error
seemed redundant since it's implied by the fact that it's 'corrupt'. I dropped the 'count' suffix because Prometheus adds a _total
prefix to the end of counter metric names. If we want to keep the suffix we should add it to all 'counter' metrics at the provider level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How was Prometheus chosen? Is it the clear leader in the open source monitoring space, or are there other options?
About the naming convention, it looks like all of our current metric names violate the Prometheus default since they use dot (.
) as a separator (they also use underscore (_
) as sub-separator), where Prometheus uses underscore only. Do I have that right?
_gauge_class = Gauge | ||
|
||
def enabled(self) -> bool: | ||
return api._api_key and api._application_key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't look like it will always return bool
(it will probably return a string or None
). If you want a bool you could do
return bool(api._api_key and api._application_key)
But that doesn't seem very Pythonic. I'd probably drop the type hint.
However, I'm not very familiar with the conventions around type hints. Does -> bool
imply that returned object can be treated like a bool (most objects support that), or that the returned value is always True
or False
? This is one reason I'm not a fan of type hints in Python.
|
||
METRIC_NAME_RE = re.compile(r'^[a-zA-Z_:.][a-zA-Z0-9_:.]*$') | ||
METRIC_TAG_NAME_RE = re.compile(r'^[a-zA-Z_][a-zA-Z0-9_]*$') | ||
RESERVED_METRIC_TAG_NAME_RE = re.compile(r'^__.*$') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be simplified?
RESERVED_METRIC_TAG_NAME_RE = re.compile(r'^__.*$') | |
RESERVED_METRIC_TAG_NAME_RE = re.compile(r'^__') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, this was copied from the Prometheus client.
corehq/util/metrics/prometheus.py
Outdated
return self.__class__(self.name, self.documentation, self.tag_names, self.tag_values, delegate) | ||
|
||
|
||
class Counter(PrometheusMetricBase, HqCounter): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oof, multiple inheritance with diamonds... I don't love this.
I'd rather have simple functional call patterns like our existing datadog_counter(..., tags=...)
(made generic so as not to have "datadog" in the name) rather than changing all of our metrics calls to use a new metrics system with extensive use of inheritance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the inheritance isn't great here. The alternative is a bit more duplicate code but not a lot. I'll see what I can do to make it better if we stay on this course.
If we want to keep it functional then we would need to have a centralized metrics singleton that would always return the same metric class when called (in the case of Pometheus). It would also mean that every call to the metric would need to pass in all required args (which is the case now anyway):
metrics_histogram(value, tags=..., buckets=..., unit=..., dd_bucket_tag=...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've resolved this by removing the need to create new copies of the class when applying tags:
- metric.tag(group='a').inc()
+ metric.inc(group='a')
return enabled | ||
|
||
|
||
metrics = DelegatedMetrics(SimpleLazyObject(_get_metrics)) # singleton/global |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that DelegatedMetrics
is always used even when there is only a single "metrics" object to delegate to? That seems wasteful. Would it be possible to only delegate if there is more than one (I would expect that to be the exceptional case).
We did a basic survey. There a few options but Prometheus appears to be the more common one and support all the use cases we need.
That's correct. I've already taken that into account by replacing |
from django.utils.functional import SimpleLazyObject | ||
|
||
from corehq.util.metrics import DatadogMetrics, PrometheusMetrics | ||
from corehq.util.metrics.metrics import ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
F401 'corehq.util.metrics.metrics.HqHistogram' imported but unused
Note to reviewers: I'm offline till Monday. I'd like to schedule a call for when I'm back to decide on a course of action for this. If you'd like to weigh in please review by then (9 am EST on 16 March). Points of discussion:
|
This change means we don't have to instantiate a new class when applying the tags which simplifies the class structure a lot.
class HqHistogram(MetricBase): | ||
|
||
def __init__(self, name: str, documentation: str, | ||
bucket_tag: str, buckets: List[int] = DEFAULT_BUCKETS, bucket_unit: str = '', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E127 continuation line over-indented for visual indent
|
||
def __init__(self, name: str, documentation: str, | ||
bucket_tag: str, buckets: List[int] = DEFAULT_BUCKETS, bucket_unit: str = '', | ||
tag_names: Iterable = ()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E127 continuation line over-indented for visual indent
@@ -0,0 +1,59 @@ | |||
import settings | |||
from corehq.util.metrics.metrics import ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
F401 'corehq.util.metrics.metrics.MetricBase' imported but unused
closing in favour of #26876 |
This isn't done yet but putting up for initial review.
SUMMARY
Implementation of #26816
Initial code structure and classes to allow supporting multiple monitoring backends. Current support for Datadog and Prometheus.
I've converted a few metrics to see how it would work. Further notes in the code.
Missing pieces:
Migration plan: