UCR metrics using Datadog for Rebuild and Rebuild-in-place functionality #35496

ajeety4 · 2024-12-10T15:48:26Z

Product Description

Adds metrics related to Rebuild and Rebuild-in-place functionality to be captured on DataDog.
No user facing change.

DataDog Dashboard Screenshot

Technical Summary

Ticket.
This is a part of feature analytics wherein intention is to understand the usage of Rebuild and Rebuild-in-place functionality for UCRs.

Some of the metrics are nuanced and required either database access or some minor calculations. Any exceptions raised by them are ignored intentionally to not block the original functionality.

Feature Flag

User Configurable Reports

Safety Assurance

Safety story

Tested on local. Some metrics are simple and ones that involved code for metrics calculation will ignores any exception raised.
To be tested on staging

Automated test coverage

QA Plan

None

Rollback instructions

This PR can be reverted after deploy with no further considerations

Labels & Review

Risk label is set correctly
The set of people pinged as reviewers is appropriate for the level of risk of the change

… rebuild in place

if a column is added or removed in datasource config, it is not reflected on table until rebuild. Hence any query that runs on the table with updated config before rebuild fails. The new method added loads table directly from database and allows for queries on the existing table

Charl1996 · 2024-12-12T12:44:39Z

corehq/apps/userreports/tasks.py

+        if error:
+            _report_metric_rebuild_error(config, action)
+        if not error:
+            _report_metric_increase_in_rows_count(config, action, adapter, rows_count_before_rebuild)


Can be refactored:

if error:
...
else:
...

It just reads better.

Addressed in f4bc32e

kaapstorm · 2024-12-12T12:28:53Z

corehq/apps/userreports/sql/adapter.py

@@ -386,6 +392,10 @@ def get_indicator_table(indicator_config, metadata, override_table_name=None):
    )


+def get_current_indicator_table_from_db(metadata, table_name):


Is this function used anywhere?

Good catch. Forgotten cleanup.
Addressed in a5c51e5

kaapstorm · 2024-12-12T12:32:42Z

corehq/apps/userreports/tasks.py

+def _get_rows_count_from_existing_table(adapter):
+    table = adapter.get_existing_table_from_db()
+    return adapter.session_helper.Session.query(table).count()


I'm ambivalent about this suggestion, because I like that your current implementation makes it clear that the value of rows_count_before_rebuild can be None, and we aren't allowed to use type hints.

But if you move the if statement inside the function, then you don't have to repeat it twice outside the function.

Suggested change

def _get_rows_count_from_existing_table(adapter):

table = adapter.get_existing_table_from_db()

return adapter.session_helper.Session.query(table).count()

def _get_rows_count_from_existing_table(adapter):

if not adapter.table_exists:

return None

table = adapter.get_existing_table_from_db()

return adapter.session_helper.Session.query(table).count()

I was ambivalent too on this and ended up keeping the check outside considering this is a private function (limited usage) and to keep function responsibility simple.
However after rethinking, I like your suggestion. It makes return explicit along with reducing duplication and the private function argument works both ways.

Addressed in f4bc32e

kaapstorm · 2024-12-12T12:35:56Z

corehq/apps/userreports/tasks.py

+    if source not in ('edit_data_source_rebuild', 'edit_data_source_build_in_place'):
+        return
+    try:
+        _report_metric_number_of_days_since_first_build(config.domain, config.get_id, action)


Supernit. Just passing config would make this function signature consistent with the other two.

Suggested change

_report_metric_number_of_days_since_first_build(config.domain, config.get_id, action)

_report_metric_number_of_days_since_first_build(config, action)

Addressed in f4bc32e

kaapstorm · 2024-12-12T12:36:15Z

corehq/apps/userreports/tasks.py

+        if error:
+            _report_metric_rebuild_error(config, action)
+        if not error:
+            _report_metric_increase_in_rows_count(config, action, adapter, rows_count_before_rebuild)


Nit.

Suggested change

if error:

_report_metric_rebuild_error(config, action)

if not error:

_report_metric_increase_in_rows_count(config, action, adapter, rows_count_before_rebuild)

if error:

_report_metric_rebuild_error(config, action)

else:

_report_metric_increase_in_rows_count(config, action, adapter, rows_count_before_rebuild)

Addressed in f4bc32e

kaapstorm · 2024-12-12T12:37:44Z

corehq/apps/userreports/ui/forms.py

+            metrics_counter('commcare.ucr.datasource.change_in_filters', tags={'domain': self.domain})
+        if 'configured_indicators' in self.changed_data:
+            if len(self.instance.configured_indicators) > len(self.initial.get('configured_indicators', [])):
+                metrics_counter('commcare.ucr.datasource.increase_in_columns', tags={'domain': self.domain})


Do we definitely not care about rebuilds where the user decreases the number of columns? (Maybe an AE question.)

Good question. I will check with AE on this.

Hey Norman, I re-read the intention of this metric (frrom the document made by AE) and you are right in that we do not care about the decrease. It looks the rebuild should majorly be driven when there is need for an extra column.

Just quoting,
A rebuild datasource action should only be triggered if there is a new column

kaapstorm · 2024-12-12T12:41:58Z

corehq/apps/userreports/util.py

+def get_configurable_and_static_reports_by_data_source(domain, data_source_id):
+    reports = get_configurable_and_static_reports(domain)
+    return [report for report in reports if report.config_id == data_source_id]


by_data_source makes it sound like this function will return a dictionary.

I would either change it to for_data_source or of_data_source ... or I would return a dictionary :)

Suggested change

def get_configurable_and_static_reports_by_data_source(domain, data_source_id):

reports = get_configurable_and_static_reports(domain)

return [report for report in reports if report.config_id == data_source_id]

def get_configurable_and_static_reports_by_data_source(domain):

reports_by_data_source = defaultdict(list)

for report in get_configurable_and_static_reports(domain):

reports_by_data_source[report.config_id].append(report)

return reports_by_data_source

Addressed in 011eb70

kaapstorm · 2024-12-12T12:56:54Z

corehq/apps/userreports/views.py

+    metrics_gauge(
+        f'commcare.ucr.{action}.columns.count',
+        len(config.get_columns()),
+        tags={'domain': domain}
+    )


I think unless you tag the data source in some way, the value for the domain will change with every data source you send a metric for. I've used the data source's display name in this suggestion, but there is nothing that enforces uniqueness, so there is still the risk that the value will flip for different data sources with the samme display name. But on the other hand, maybe it's more usable than a config ID. 🤷

Suggested change

metrics_gauge(

f'commcare.ucr.{action}.columns.count',

len(config.get_columns()),

tags={'domain': domain}

)

metrics_gauge(

f'commcare.ucr.{action}.columns.count',

len(config.get_columns()),

tags={

'domain': domain,

'data_source': config.display_name,

}

)

think unless you tag the data source in some way, the value for the domain will change with every data source you send a metric for. I

That is right and a valid concern.

The original requirement was for including datasource_id. It was my recommendation to not include it considering below factors:

The aim is to keep the cardinality(variance) to minimum in general to avoid increase in Datadog costs. Note that this is not a restriction though.

Based on my the requirements, the idea is see the general trend of how many a columns a datasource has while being rebuilt instead of a report with datasource and their columns count. Since the ticket is specificaly focused to track the usage of Rebuild and Rebuild in Place functionality, a general trend (e.g average) seems to be a reasonable insight.

For the DataDog visualisation, the intention is to show a average of above metric with an option to segment/filter by domain.

Let me know if this is something that sounds reasonable.

Thanks. That does sound reasonable. Just taking the average is an excellent approach.

kaapstorm · 2024-12-12T13:01:52Z

corehq/apps/userreports/views.py

+        metrics_gauge(
+            f'commcare.ucr.{action}.reports_per_datasource.count',
+            len(reports),
+            tags={'domain': domain}
+        )


Again, I think you'll need to include the data source in the tag, otherwise the value for the domain will just keep alternating. Datadog will allow you to sum for all data sources in the domaain, or average across data sources. (Going with config ID this time just to be different :) )

Suggested change

metrics_gauge(

f'commcare.ucr.{action}.reports_per_datasource.count',

len(reports),

tags={'domain': domain}

)

metrics_gauge(

f'commcare.ucr.{action}.reports_per_datasource.count',

len(reports),

tags={

'domain': domain,

'data_source': config.get_id,

}

)

zandre-eng

Looks good overall, just a few minor comments.

corehq/apps/userreports/tasks.py

zandre-eng · 2024-12-12T13:19:59Z

corehq/apps/userreports/tasks.py

+        try:
+            adapter.rebuild_table(initiated_by=initiated_by, source=source, skip_log=skip_log, diffs=diffs)
+            _iteratively_build_table(config, limit=limit)
+        except Exception:


Big nit: Is there a specific exception we're trying to catch here, or just generally making sure nothing goes wrong? If it's the former then catching the specific exception here instead of a generic Exception would make it clearer what could go wrong.

Good question.

Here the intention for the metric is to catch if a rebuild fails for any reason. The idea is establish a relationship between a rebuild failure and the number of rows a datasource have. It has been observed a rebuild fails for some datasource with huge no. of records.

On other note, I like the point of having clarity for what went wrong. Since we are not sure of the reason, I am wondering if the reason is being captured today. If not, we should probably log them into sentry. Will look into this.

If we're not entirely sure what might go wrong, and if there are a few things that could go wrong, then totally fine to leave as is. Could be a good followup if we discover specific errors that we are expecting.

corehq/apps/userreports/tasks.py

mkangia

I see you already have a number of reviewers active on this already so I didn't look in depth.
Just overall, excited for having metrics for UCRs.

I see we are adding quite a few so just wanted to flag that we should be clear on the "relevance" of the metrics and there usability. We are adding quite a few queries to fetch the metrics so it would be good to make sure that these are going to be useful and will be used.

No pressure but Is it possible to see how they are appearing on datadog, in the PR description?

mkangia · 2024-12-13T07:55:20Z

corehq/apps/userreports/sql/adapter.py

+    @memoized
+    def get_existing_table_from_db(self):
+        """Loads existing table directly from database if one exists"""
+        if self.table_exists:


I don't know how this works so just checking on the implementation here.
We are checking for the table to exist before we instantiate it.

Is it making the code do more work then needed?

I would think we could directly instantiate and in case of table missing, there would be an exception that we could catch and return nothing. No?

This is a good point. It does seem like the check looks for the table form the database so it is an additional hit to a database.
Can confirm that exception is raised here so will update this.

Addressed in 4864944

mkangia · 2024-12-13T07:57:28Z

corehq/apps/userreports/tasks.py

+
+
+def _get_rows_count_from_existing_table(adapter):
+    if not adapter.table_exists:


same as earlier comment about catching exception instead of checking for existing table.

addressed in 4864944

mkangia · 2024-12-13T08:00:51Z

corehq/apps/userreports/tasks.py

+        return
+    # Row count can only be obtained for synchronous rebuilds.
+    if not config.asynchronous:
+        rows_count_after_rebuild = adapter.get_query_object().count()


nit: minor but this would "expected" count after rebuild, though that count eventually could be higher or lower than this number here.

My understanding is that it would be the actual count as we are fetching the rows count from the database. Also testing confirms this.

Are we calling this after the rebuild has finished?
It does seem that this is being called at times directly after queuing the rebuild task.

Is this the count from the table or the count from the query the table will perform?

Are we calling this after the rebuild has finished?
It does seem that this is being called at times directly after queuing the rebuild task.

Do you mean the get_query_object().count() method ? If yes, it is used at multiple places and may be called before the rebuild.

For this code, this particular method is called by _report_ucr_rebuild_metrics which is used here at the end after rebuild code is finished.

Is this the count from the table or the count from the query the table will perform?

It is the count from the table.

ajeety4 · 2024-12-13T09:13:08Z

I see we are adding quite a few so just wanted to flag that we should be clear on the "relevance" of the metrics and there usability. We are adding quite a few queries to fetch the metrics so it would be good to make sure that these are going to be useful and will be used.

Agreed.
The requirement document (attached in ticket) is quite clear on the intention behind them and how they would use them. (Good work from AE here)

No pressure but Is it possible to see how they are appearing on datadog, in the PR description?

Sure, i have deployed this on staging today. Will add them soon.

kaapstorm

I'm happy with the changes, but I think the test failures are relevant.

zandre-eng

Looks good from my end.

ajeety4 · 2024-12-13T12:40:30Z

Hey @kaapstorm @zandre-eng ,
While building out some charts on Datadog using test data from staging, I realised that for the count of rebuild (and rebuild in place) action, there is not much value in looking at the count just domain wise especially considering the goal for it. As a result, I have included datasource_id just for these two metrics.
Related commit - 50accc3

ajeety4 · 2024-12-13T15:24:29Z

Sure, i have deployed this on staging today. Will add them soon.

Hey @mkangia , included the screenshot in the description. The link is here in case you would be interested to have a better look.

ajeety4 added 11 commits December 10, 2024 21:05

count metric for datasource rebuild and rebuild in place

e04c02b

columns count metric for datasource rebuild and rebuild in place

3f435ae

filters change and colulmns increase metrics for datasource

582775c

util method to get reports by datasource ID

484777b

reports per datasource on rebuild and rebuild in place

6b6331a

nit: spelling

928e915

nit: isort

66b3dc0

add missing source for rebuild datasource

a2ea739

metric: days difference betwen first and last rebuild for rebuild and…

4cc390c

… rebuild in place

metric: error for rebuild and rebuild in place

52b27d8

metric: increase in rows count after rebuild

42cd8b8

ajeety4 added product/feature-flag Change will only affect users who have a specific feature flag enabled Open for review: do not merge A work in progress product/invisible Change has no end-user visible impact labels Dec 10, 2024

ajeety4 marked this pull request as ready for review December 11, 2024 08:16

ajeety4 requested a review from calellowitz as a code owner December 11, 2024 08:16

ajeety4 requested review from kaapstorm, zandre-eng, Charl1996, mkangia and calellowitz and removed request for calellowitz December 11, 2024 08:16

ajeety4 force-pushed the ay/ucr-metrics-datadog branch from 443f34f to e37b241 Compare December 12, 2024 11:34

Charl1996 reviewed Dec 12, 2024

View reviewed changes

kaapstorm reviewed Dec 12, 2024

View reviewed changes

zandre-eng reviewed Dec 12, 2024

View reviewed changes

ajeety4 added 3 commits December 12, 2024 19:53

nit: cleanup

a5c51e5

nit: function rename

011eb70

nit: minor refactors

f4bc32e

mkangia reviewed Dec 13, 2024

View reviewed changes

use exception instead of check to avoid db hit

4864944

kaapstorm approved these changes Dec 13, 2024

View reviewed changes

zandre-eng approved these changes Dec 13, 2024

View reviewed changes

ajeety4 added 2 commits December 13, 2024 16:56

fiexes the boolean check for table

1db397a

use datasource id as tag for rebuild counts

50accc3

kaapstorm approved these changes Dec 13, 2024

View reviewed changes

		@@ -386,6 +392,10 @@ def get_indicator_table(indicator_config, metadata, override_table_name=None):
		)


		def get_current_indicator_table_from_db(metadata, table_name):

	_report_metric_number_of_days_since_first_build(config.domain, config.get_id, action)
	_report_metric_number_of_days_since_first_build(config, action)

-def get_configurable_and_static_reports_by_data_source(domain, data_source_id):
-    reports = get_configurable_and_static_reports(domain)
-    return [report for report in reports if report.config_id == data_source_id]
+def get_configurable_and_static_reports_by_data_source(domain):
+    reports_by_data_source = defaultdict(list)
+    for report in get_configurable_and_static_reports(domain):
+        reports_by_data_source[report.config_id].append(report)
+    return reports_by_data_source



		def _get_rows_count_from_existing_table(adapter):
		if not adapter.table_exists:

UCR metrics using Datadog for Rebuild and Rebuild-in-place functionality #35496

Are you sure you want to change the base?

UCR metrics using Datadog for Rebuild and Rebuild-in-place functionality #35496

Conversation

ajeety4 commented Dec 10, 2024 • edited Loading

Product Description

Technical Summary

Feature Flag

Safety Assurance

Safety story

Automated test coverage

QA Plan

Rollback instructions

Labels & Review

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zandre-eng left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mkangia left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajeety4 commented Dec 13, 2024

kaapstorm left a comment

Choose a reason for hiding this comment

zandre-eng left a comment

Choose a reason for hiding this comment

ajeety4 commented Dec 13, 2024

ajeety4 commented Dec 13, 2024

ajeety4 commented Dec 10, 2024 •

edited

Loading