Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

monitoring functional #26876

Merged
merged 43 commits into from
Mar 19, 2020
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
90bdf7d
move datadog init to AppConfig.ready
snopoke Mar 9, 2020
7ad93fa
pluggable metrics
snopoke Mar 10, 2020
0694926
add histogram and tests
snopoke Mar 10, 2020
c1f8270
bucket values less than or equal
snopoke Mar 10, 2020
5034156
add prometheus client to requirements
snopoke Mar 10, 2020
172ef37
Merge branch 'master' into sk/monitoring
snopoke Mar 11, 2020
2be4e6a
make metrics lazy
snopoke Mar 11, 2020
eb8afe3
example histogram
snopoke Mar 11, 2020
54e91c0
Merge branch 'master' into sk/monitoring
snopoke Mar 11, 2020
54f2d1b
convert sumbission metrics
snopoke Mar 11, 2020
eba69c4
docstrings
snopoke Mar 11, 2020
01fbc07
stickler
snopoke Mar 11, 2020
450f211
keep tag_values as dict instead of splitting and re-combining
snopoke Mar 12, 2020
2c3edef
update links
snopoke Mar 12, 2020
8664161
remove unnecessary list
snopoke Mar 12, 2020
0bcedea
replace typle() with ()
snopoke Mar 12, 2020
9ba3d4a
Merge branch 'sk/monitoring' of github.com:dimagi/commcare-hq into sk…
snopoke Mar 12, 2020
904e358
fix tags
snopoke Mar 12, 2020
d68e104
pass other args
snopoke Mar 12, 2020
5c00052
revert change to datadog bucketing boundry
snopoke Mar 12, 2020
e00b7e1
remove unnecessary list
snopoke Mar 16, 2020
20e8a61
apply tags at the same time as recording the metric
snopoke Mar 16, 2020
0ab0006
dummy metric
snopoke Mar 16, 2020
02b40ee
functional interface
snopoke Mar 16, 2020
31cfba9
re-do configuration via settings
snopoke Mar 17, 2020
892e57b
move initialization into provider
snopoke Mar 17, 2020
dbdbd3f
replace datadog_gauge
snopoke Mar 17, 2020
5897eb9
instantiate provider
snopoke Mar 17, 2020
28a8a5b
hook up metrics view
snopoke Mar 17, 2020
abd0355
todo
snopoke Mar 17, 2020
75fa979
lint
snopoke Mar 17, 2020
017ca5a
PR feedback
snopoke Mar 18, 2020
770f4a0
log output from DebugMetrics
snopoke Mar 18, 2020
8bb9e03
move metrics view to hq/admin
snopoke Mar 18, 2020
9a49d26
move metrics_gauge_task to package init
snopoke Mar 18, 2020
74ce07f
fix import
snopoke Mar 18, 2020
f1874e4
add script for running metrics endpoint
snopoke Mar 18, 2020
d4220f0
update docs
snopoke Mar 18, 2020
0457c00
docs
snopoke Mar 18, 2020
3d79912
do setup in __init__
snopoke Mar 18, 2020
d4c3dc1
simplify prometheus server
snopoke Mar 19, 2020
1017c85
doc updates
snopoke Mar 19, 2020
d18075d
Apply suggestions from code review
snopoke Mar 19, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
108 changes: 105 additions & 3 deletions corehq/util/metrics/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,102 @@
"""
Metrics collection
******************

.. contents::
:local:

This package exposes functions and utilities to record metrics in CommCare. These metrics
are exported / exposed to the configured metrics providers. Supported providers are:

* Datadog
* Prometheus

Providers are enabled using the `METRICS_PROVIDER` setting. Multiple providers can be
enabled concurrently:

::

METRICS_PROVIDERS = [
'corehq.util.metrics.prometheus.PrometheusMetrics',
'corehq.util.metrics.datadog.DatadogMetrics',
]

If no metrics providers are configured CommCare will log all metrics to the `commcare.metrics` logger
at the DEBUG level.

Metric tagging
==============
Metrics may be tagged by passing a dictionary of tag names and values. Tags should be used
add dimensions to a metric e.g. request type, response status.
snopoke marked this conversation as resolved.
Show resolved Hide resolved

Tags should not originate from unbounded sources or sources with high dimensionality such as
timestamps, user IDs, request IDs etc. Ideally a tag should not have more than 10 possible values.

Read more about tagging:

* https://prometheus.io/docs/practices/naming/#labels
* https://docs.datadoghq.com/tagging/

Metric Types
============

Counter metric
''''''''''''''

A counter is a cumulative metric that represents a single monotonically increasing counter
whose value can only increase or be reset to zero on restart. For example, you can use a
counter to represent the number of requests served, tasks completed, or errors.

Do not use a counter to expose a value that can decrease. For example, do not use a counter
for the number of currently running processes; instead use a gauge.

::

metrics_counter('commcare.case_import.count', 1, tags={'domain': domain})


Gauge metric
''''''''''''

A gauge is a metric that represents a single numerical value that can arbitrarily go up and down.

Gauges are typically used for measured values like temperatures or current memory usage,
but also "counts" that can go up and down, like the number of concurrent requests.

::

metrics_gauge('commcare.case_import.queue_length', queue_length)

For regular reporting of a gauge metric there is the `metrics_gauge_task` function:

.. autofunction:: corehq.util.metrics.metrics_gauge_task

Histogram metric
''''''''''''''''

A histogram samples observations (usually things like request durations or response sizes)
and counts them in configurable buckets.

::

metrics_histogram(
'commcare.case_import.duration', timer_duration,
bucket_tag='size', buckets=[10, 50, 200, 1000], bucket_unit='s',
tags={'domain': domain}
)

Histograms are recorded differently in the different providers.

.. automethod:: corehq.util.metrics.datadog.DatadogMetrics._histogram

.. automethod:: corehq.util.metrics.prometheus.PrometheusMetrics._histogram


Other Notes
===========

* All metrics must use the prefix 'commcare.'
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks for adding!

from functools import wraps
from typing import Iterable
snopoke marked this conversation as resolved.
Show resolved Hide resolved

Expand Down Expand Up @@ -55,13 +154,16 @@ def metrics_histogram(

def metrics_gauge_task(name, fn, run_every):
"""
helper for easily registering gauges to run periodically
Helper for easily registering gauges to run periodically

To update a gauge on a schedule based on the result of a function
just add to your app's tasks.py:

my_calculation = metrics_gauge_task('commcare.my.metric', my_calculation_function,
run_every=crontab(minute=0))
::

my_calculation = metrics_gauge_task(
'commcare.my.metric', my_calculation_function, run_every=crontab(minute=0)
)

"""
_enforce_prefix(name, 'commcare')
Expand Down
19 changes: 16 additions & 3 deletions corehq/util/metrics/datadog.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,13 @@


class DatadogMetrics(HqMetrics):
"""Datadog Metrics Provider

Settings:
* DATADOG_API_KEY
* DATADOG_APP_KEY
"""

def initialize(self):
if not settings.DATADOG_API_KEY or not settings.DATADOG_APP_KEY:
raise Exception(
Expand All @@ -38,10 +45,13 @@ def initialize(self):
pass

def _counter(self, name: str, value: float, tags: dict = None, documentation: str = ''):
"""Although this is submitted as a COUNT the Datadog app represents these as a RATE.
See https://docs.datadoghq.com/developers/metrics/types/?tab=rate#definition"""
dd_tags = _format_tags(tags)
_datadog_record(statsd.increment, name, value, dd_tags)

def _gauge(self, name: str, value: float, tags: dict = None, documentation: str = ''):
"""See https://docs.datadoghq.com/developers/metrics/types/?tab=gauge#definition"""
dd_tags = _format_tags(tags)
_datadog_record(statsd.gauge, name, value, dd_tags)

Expand All @@ -57,16 +67,19 @@ def _histogram(self, name: str, value: float,

For example:

::

h = metrics_histogram(
'commcare.request.duration', 1.4,
bucket_tag='duration', buckets=[1,2,3], bucket_units='ms',
tags=tags
)

# resulting Datadog metric
# commcare.request.duration:1|c|#duration:lt_2ms
# resulting metrics
# commcare.request.duration:1|c|#duration:lt_2ms

For more explanation about why this implementation was chosen see:

For more details see:
* https://github.com/dimagi/commcare-hq/pull/17080
* https://github.com/dimagi/commcare-hq/pull/17030#issuecomment-315794700
"""
Expand Down
34 changes: 33 additions & 1 deletion corehq/util/metrics/prometheus.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,17 +8,49 @@


class PrometheusMetrics(HqMetrics):
def __init__(self):
"""Prometheus Metrics Provider"""

def initialize(self):
self._metrics = {}

def _counter(self, name: str, value: float = 1, tags: dict = None, documentation: str = ''):
"""See https://prometheus.io/docs/concepts/metric_types/#counter"""
self._get_metric(PCounter, name, tags, documentation).inc(value)

def _gauge(self, name: str, value: float, tags: dict = None, documentation: str = ''):
"""See https://prometheus.io/docs/concepts/metric_types/#histogram"""
self._get_metric(PGauge, name, tags, documentation).set(value)

def _histogram(self, name: str, value: float, bucket_tag: str, buckets: List[int], bucket_unit: str = '',
tags: dict = None, documentation: str = ''):
"""
A cumulative histogram with a base metric name of <basename> exposes multiple time series
snopoke marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You missed the one on this line

Suggested change
A cumulative histogram with a base metric name of <basename> exposes multiple time series
A cumulative histogram with a base metric name of <name> exposes multiple time series

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks: #26916

during a scrape:

* cumulative counters for the observation buckets, exposed as
`<basename>_bucket{le="<upper inclusive bound>"}`
* the total sum of all observed values, exposed as `<basename>_sum`
* the count of events that have been observed, exposed as `<basename>_count`
(identical to `<basename>_bucket{le="+Inf"}` above)

For example
::

h = metrics_histogram(
'commcare.request_duration', 1.4,
bucket_tag='duration', buckets=[1,2,3], bucket_units='ms',
tags=tags
)

# resulting metrics
# commcare_request_duration_bucket{...tags..., le="1.0"} 0.0
# commcare_request_duration_bucket{...tags..., le="2.0"} 1.0
# commcare_request_duration_bucket{...tags..., le="3.0"} 1.0
# commcare_request_duration_bucket{...tags..., le="+Inf"} 1.0
# commcare_request_duration_sum{...tags...} 1.4
# commcare_request_duration_count{...tags...} 1.0

See https://prometheus.io/docs/concepts/metric_types/#histogram"""
self._get_metric(PHistogram, name, tags, documentation, buckets=buckets).observe(value)

def _get_metric(self, metric_type, name, tags, documentation, **kwargs):
Expand Down
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ Welcome to CommCareHQ's documentation!
openmrs
js-guide/README
databases
metrics

Tips for documenting
--------------------
Expand Down
1 change: 1 addition & 0 deletions docs/metrics.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.. automodule:: corehq.util.metrics