Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wip] prometheus monitoring support #26837

Closed
wants to merge 23 commits into from
Closed

[wip] prometheus monitoring support #26837

wants to merge 23 commits into from

Conversation

snopoke
Copy link
Contributor

@snopoke snopoke commented Mar 11, 2020

This isn't done yet but putting up for initial review.

SUMMARY

Implementation of #26816

Initial code structure and classes to allow supporting multiple monitoring backends. Current support for Datadog and Prometheus.

I've converted a few metrics to see how it would work. Further notes in the code.

Missing pieces:

Migration plan:

  • this can be done piecemeal so once the basics are done I'll start with that
  • there will likely be some decisions to make about changes to the metrics

corehq/util/datadog/apps.py Show resolved Hide resolved
corehq/util/metrics/__init__.py Outdated Show resolved Hide resolved
corehq/util/metrics/metrics.py Outdated Show resolved Hide resolved
corehq/util/metrics/metrics.py Outdated Show resolved Hide resolved
corehq/util/metrics/metrics.py Outdated Show resolved Hide resolved

datadog_counter('commcare.xform_submissions.count', tags=tags)
submission_counter.tag(**tags).inc()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example of a metric where we have attached multiple histogram bucket tags to a single metric. If we want to convert these histograms to the native histogram types then they must be separated out as is done here.

corehq/util/datadog/utils.py Outdated Show resolved Hide resolved
return enabled


metrics = DelegatedMetrics(SimpleLazyObject(_get_metrics)) # singleton/global
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially I had this as a function get_metrics() but after making it lazy there wasn't a need for the function. This will only call the _get_metrics function when a metric class is actually used. This is needed make sure that Django is fully set up before determining if the metrics providers are enabled (since they rely on settings).

m = metrics.counter('name', 'description')  # still lazy
m.inc()  # now it's evaluated

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that DelegatedMetrics is always used even when there is only a single "metrics" object to delegate to? That seems wasteful. Would it be possible to only delegate if there is more than one (I would expect that to be the exceptional case).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it does mean that. The alternative is to split DelegatedMetrics into two classes: DelegatedMetrics and LazyMetrics and then combine them as needed (the laziness is always needed).

_datadog_record(statsd.gauge, self.name, value, tags)


class Histogram(HqHistogram):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As stated in the docstring this isn't a real histogram. It uses our workaround of adding bucketed tags.

"Did you mean to call your metric 'commcare.{}'? ".format(name))


def _validate_tag_names(tag_names):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required by Prometheus.

self._record(value)


DEFAULT_BUCKETS = (.005, .01, .025, .05, .075, .1, .25, .5, .75, 1.0, 2.5, 5.0, 7.5, 10.0, INF)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default in Prometheus

@snopoke snopoke added the product/invisible Change has no end-user visible impact label Mar 11, 2020
from corehq.util.timer import TimingContext

PROFILE_PROBABILITY = float(os.getenv('COMMCARE_PROFILE_SUBMISSION_PROBABILITY', 0))
PROFILE_LIMIT = os.getenv('COMMCARE_PROFILE_SUBMISSION_LIMIT')
PROFILE_LIMIT = int(PROFILE_LIMIT) if PROFILE_LIMIT is not None else 1

corrupt_multimedia_counter = metrics.counter(
'commcare.corrupt_multimedia_submissions', 'Count of requests with corrupt multimedia',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is changed

- commcare.corrupt-multimedia-submission.error.count
+ commcare.corrupt_multimedia_submissions

I changed this because .error seemed redundant since it's implied by the fact that it's 'corrupt'. I dropped the 'count' suffix because Prometheus adds a _total prefix to the end of counter metric names. If we want to keep the suffix we should add it to all 'counter' metrics at the provider level.

Copy link
Contributor

@millerdev millerdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How was Prometheus chosen? Is it the clear leader in the open source monitoring space, or are there other options?

About the naming convention, it looks like all of our current metric names violate the Prometheus default since they use dot (.) as a separator (they also use underscore (_) as sub-separator), where Prometheus uses underscore only. Do I have that right?

_gauge_class = Gauge

def enabled(self) -> bool:
return api._api_key and api._application_key
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't look like it will always return bool (it will probably return a string or None). If you want a bool you could do

return bool(api._api_key and api._application_key)

But that doesn't seem very Pythonic. I'd probably drop the type hint.

However, I'm not very familiar with the conventions around type hints. Does -> bool imply that returned object can be treated like a bool (most objects support that), or that the returned value is always True or False? This is one reason I'm not a fan of type hints in Python.


METRIC_NAME_RE = re.compile(r'^[a-zA-Z_:.][a-zA-Z0-9_:.]*$')
METRIC_TAG_NAME_RE = re.compile(r'^[a-zA-Z_][a-zA-Z0-9_]*$')
RESERVED_METRIC_TAG_NAME_RE = re.compile(r'^__.*$')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be simplified?

Suggested change
RESERVED_METRIC_TAG_NAME_RE = re.compile(r'^__.*$')
RESERVED_METRIC_TAG_NAME_RE = re.compile(r'^__')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, this was copied from the Prometheus client.

corehq/util/metrics/metrics.py Outdated Show resolved Hide resolved
return self.__class__(self.name, self.documentation, self.tag_names, self.tag_values, delegate)


class Counter(PrometheusMetricBase, HqCounter):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof, multiple inheritance with diamonds... I don't love this.

I'd rather have simple functional call patterns like our existing datadog_counter(..., tags=...) (made generic so as not to have "datadog" in the name) rather than changing all of our metrics calls to use a new metrics system with extensive use of inheritance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the inheritance isn't great here. The alternative is a bit more duplicate code but not a lot. I'll see what I can do to make it better if we stay on this course.

If we want to keep it functional then we would need to have a centralized metrics singleton that would always return the same metric class when called (in the case of Pometheus). It would also mean that every call to the metric would need to pass in all required args (which is the case now anyway):

metrics_histogram(value, tags=..., buckets=..., unit=..., dd_bucket_tag=...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've resolved this by removing the need to create new copies of the class when applying tags:

- metric.tag(group='a').inc()
+ metric.inc(group='a')

corehq/util/metrics/datadog.py Outdated Show resolved Hide resolved
corehq/util/datadog/utils.py Outdated Show resolved Hide resolved
return enabled


metrics = DelegatedMetrics(SimpleLazyObject(_get_metrics)) # singleton/global
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that DelegatedMetrics is always used even when there is only a single "metrics" object to delegate to? That seems wasteful. Would it be possible to only delegate if there is more than one (I would expect that to be the exceptional case).

corehq/util/metrics/metrics.py Outdated Show resolved Hide resolved
corehq/util/metrics/tests/utils.py Outdated Show resolved Hide resolved
corehq/util/metrics/tests/utils.py Outdated Show resolved Hide resolved
@snopoke
Copy link
Contributor Author

snopoke commented Mar 12, 2020

How was Prometheus chosen? Is it the clear leader in the open source monitoring space, or are there other options?

We did a basic survey. There a few options but Prometheus appears to be the more common one and support all the use cases we need.

About the naming convention, it looks like all of our current metric names violate the Prometheus default since they use dot (.) as a separator (they also use underscore (_) as sub-separator), where Prometheus uses underscore only. Do I have that right?

That's correct. I've already taken that into account by replacing . with _ for Prometheus so I'm not suggesting we change that. I was meaning more of the semantic guidelines.

from django.utils.functional import SimpleLazyObject

from corehq.util.metrics import DatadogMetrics, PrometheusMetrics
from corehq.util.metrics.metrics import (

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

F401 'corehq.util.metrics.metrics.HqHistogram' imported but unused

@snopoke
Copy link
Contributor Author

snopoke commented Mar 12, 2020

Note to reviewers: I'm offline till Monday. I'd like to schedule a call for when I'm back to decide on a course of action for this. If you'd like to weigh in please review by then (9 am EST on 16 March).

Points of discussion:

  • Class based vs functional API
    counter = metrics.counter(name, tag_names=['a'])
    counter.tag(a='tag').inc(2)
    
    vs
    metrics_counter(name, 2, tags={'a': 'tag'})
    
  • whether we should try and standardise the metric naming and usage at all

snopoke added 2 commits March 16, 2020 13:40
This change means we don't have to instantiate a new
class when applying the tags which simplifies the class
structure a lot.
class HqHistogram(MetricBase):

def __init__(self, name: str, documentation: str,
bucket_tag: str, buckets: List[int] = DEFAULT_BUCKETS, bucket_unit: str = '',

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E127 continuation line over-indented for visual indent


def __init__(self, name: str, documentation: str,
bucket_tag: str, buckets: List[int] = DEFAULT_BUCKETS, bucket_unit: str = '',
tag_names: Iterable = ()):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E127 continuation line over-indented for visual indent

@@ -0,0 +1,59 @@
import settings
from corehq.util.metrics.metrics import (

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

F401 'corehq.util.metrics.metrics.MetricBase' imported but unused

@snopoke snopoke mentioned this pull request Mar 17, 2020
@snopoke
Copy link
Contributor Author

snopoke commented Mar 17, 2020

closing in favour of #26876

@snopoke snopoke closed this Mar 17, 2020
@snopoke snopoke deleted the sk/monitoring branch March 17, 2020 14:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
product/invisible Change has no end-user visible impact
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants