Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCR metrics using Datadog for Rebuild and Rebuild-in-place functionality #35496

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

ajeety4
Copy link
Contributor

@ajeety4 ajeety4 commented Dec 10, 2024

Product Description

Adds metrics related to Rebuild and Rebuild-in-place functionality to be captured on DataDog.
No user facing change.

DataDog Dashboard Screenshot
image

Technical Summary

Ticket.
This is a part of feature analytics wherein intention is to understand the usage of Rebuild and Rebuild-in-place functionality for UCRs.

Some of the metrics are nuanced and required either database access or some minor calculations. Any exceptions raised by them are ignored intentionally to not block the original functionality.

Feature Flag

User Configurable Reports

Safety Assurance

Safety story

Tested on local. Some metrics are simple and ones that involved code for metrics calculation will ignores any exception raised.
To be tested on staging

Automated test coverage

QA Plan

None

Rollback instructions

  • This PR can be reverted after deploy with no further considerations

Labels & Review

  • Risk label is set correctly
  • The set of people pinged as reviewers is appropriate for the level of risk of the change

@ajeety4 ajeety4 added product/feature-flag Change will only affect users who have a specific feature flag enabled Open for review: do not merge A work in progress product/invisible Change has no end-user visible impact labels Dec 10, 2024
@ajeety4 ajeety4 marked this pull request as ready for review December 11, 2024 08:16
@ajeety4 ajeety4 requested a review from calellowitz as a code owner December 11, 2024 08:16
@ajeety4 ajeety4 requested review from kaapstorm, zandre-eng, Charl1996, mkangia and calellowitz and removed request for calellowitz December 11, 2024 08:16
if a column is added or removed in datasource config, it is not reflected on table until rebuild. Hence any query that runs on the table with updated config before rebuild fails. The new method added loads table directly from database and allows for queries on the existing table
@ajeety4 ajeety4 force-pushed the ay/ucr-metrics-datadog branch from 443f34f to e37b241 Compare December 12, 2024 11:34
Comment on lines 171 to 174
if error:
_report_metric_rebuild_error(config, action)
if not error:
_report_metric_increase_in_rows_count(config, action, adapter, rows_count_before_rebuild)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be refactored:

if error:
...
else:
...

It just reads better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in f4bc32e

@@ -386,6 +392,10 @@ def get_indicator_table(indicator_config, metadata, override_table_name=None):
)


def get_current_indicator_table_from_db(metadata, table_name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function used anywhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Forgotten cleanup.
Addressed in a5c51e5

Comment on lines 161 to 163
def _get_rows_count_from_existing_table(adapter):
table = adapter.get_existing_table_from_db()
return adapter.session_helper.Session.query(table).count()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ambivalent about this suggestion, because I like that your current implementation makes it clear that the value of rows_count_before_rebuild can be None, and we aren't allowed to use type hints.

But if you move the if statement inside the function, then you don't have to repeat it twice outside the function.

Suggested change
def _get_rows_count_from_existing_table(adapter):
table = adapter.get_existing_table_from_db()
return adapter.session_helper.Session.query(table).count()
def _get_rows_count_from_existing_table(adapter):
if not adapter.table_exists:
return None
table = adapter.get_existing_table_from_db()
return adapter.session_helper.Session.query(table).count()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was ambivalent too on this and ended up keeping the check outside considering this is a private function (limited usage) and to keep function responsibility simple.
However after rethinking, I like your suggestion. It makes return explicit along with reducing duplication and the private function argument works both ways.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in f4bc32e

if source not in ('edit_data_source_rebuild', 'edit_data_source_build_in_place'):
return
try:
_report_metric_number_of_days_since_first_build(config.domain, config.get_id, action)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Supernit. Just passing config would make this function signature consistent with the other two.

Suggested change
_report_metric_number_of_days_since_first_build(config.domain, config.get_id, action)
_report_metric_number_of_days_since_first_build(config, action)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in f4bc32e

Comment on lines 171 to 174
if error:
_report_metric_rebuild_error(config, action)
if not error:
_report_metric_increase_in_rows_count(config, action, adapter, rows_count_before_rebuild)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit.

Suggested change
if error:
_report_metric_rebuild_error(config, action)
if not error:
_report_metric_increase_in_rows_count(config, action, adapter, rows_count_before_rebuild)
if error:
_report_metric_rebuild_error(config, action)
else:
_report_metric_increase_in_rows_count(config, action, adapter, rows_count_before_rebuild)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in f4bc32e

metrics_counter('commcare.ucr.datasource.change_in_filters', tags={'domain': self.domain})
if 'configured_indicators' in self.changed_data:
if len(self.instance.configured_indicators) > len(self.initial.get('configured_indicators', [])):
metrics_counter('commcare.ucr.datasource.increase_in_columns', tags={'domain': self.domain})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we definitely not care about rebuilds where the user decreases the number of columns? (Maybe an AE question.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I will check with AE on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Norman, I re-read the intention of this metric (frrom the document made by AE) and you are right in that we do not care about the decrease. It looks the rebuild should majorly be driven when there is need for an extra column.

Just quoting,
A rebuild datasource action should only be triggered if there is a new column

Comment on lines 171 to 173
def get_configurable_and_static_reports_by_data_source(domain, data_source_id):
reports = get_configurable_and_static_reports(domain)
return [report for report in reports if report.config_id == data_source_id]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by_data_source makes it sound like this function will return a dictionary.

I would either change it to for_data_source or of_data_source ... or I would return a dictionary :)

Suggested change
def get_configurable_and_static_reports_by_data_source(domain, data_source_id):
reports = get_configurable_and_static_reports(domain)
return [report for report in reports if report.config_id == data_source_id]
def get_configurable_and_static_reports_by_data_source(domain):
reports_by_data_source = defaultdict(list)
for report in get_configurable_and_static_reports(domain):
reports_by_data_source[report.config_id].append(report)
return reports_by_data_source

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 011eb70

Comment on lines +1358 to +1362
metrics_gauge(
f'commcare.ucr.{action}.columns.count',
len(config.get_columns()),
tags={'domain': domain}
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think unless you tag the data source in some way, the value for the domain will change with every data source you send a metric for. I've used the data source's display name in this suggestion, but there is nothing that enforces uniqueness, so there is still the risk that the value will flip for different data sources with the samme display name. But on the other hand, maybe it's more usable than a config ID. 🤷

Suggested change
metrics_gauge(
f'commcare.ucr.{action}.columns.count',
len(config.get_columns()),
tags={'domain': domain}
)
metrics_gauge(
f'commcare.ucr.{action}.columns.count',
len(config.get_columns()),
tags={
'domain': domain,
'data_source': config.display_name,
}
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think unless you tag the data source in some way, the value for the domain will change with every data source you send a metric for. I

That is right and a valid concern.

The original requirement was for including datasource_id. It was my recommendation to not include it considering below factors:

  • The aim is to keep the cardinality(variance) to minimum in general to avoid increase in Datadog costs. Note that this is not a restriction though.
  • Based on my the requirements, the idea is see the general trend of how many a columns a datasource has while being rebuilt instead of a report with datasource and their columns count. Since the ticket is specificaly focused to track the usage of Rebuild and Rebuild in Place functionality, a general trend (e.g average) seems to be a reasonable insight.
  • For the DataDog visualisation, the intention is to show a average of above metric with an option to segment/filter by domain.

Let me know if this is something that sounds reasonable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. That does sound reasonable. Just taking the average is an excellent approach.

Comment on lines +1372 to +1376
metrics_gauge(
f'commcare.ucr.{action}.reports_per_datasource.count',
len(reports),
tags={'domain': domain}
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I think you'll need to include the data source in the tag, otherwise the value for the domain will just keep alternating. Datadog will allow you to sum for all data sources in the domaain, or average across data sources. (Going with config ID this time just to be different :) )

Suggested change
metrics_gauge(
f'commcare.ucr.{action}.reports_per_datasource.count',
len(reports),
tags={'domain': domain}
)
metrics_gauge(
f'commcare.ucr.{action}.reports_per_datasource.count',
len(reports),
tags={
'domain': domain,
'data_source': config.get_id,
}
)

Copy link
Contributor

@zandre-eng zandre-eng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, just a few minor comments.

corehq/apps/userreports/tasks.py Outdated Show resolved Hide resolved
corehq/apps/userreports/tasks.py Show resolved Hide resolved
try:
adapter.rebuild_table(initiated_by=initiated_by, source=source, skip_log=skip_log, diffs=diffs)
_iteratively_build_table(config, limit=limit)
except Exception:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big nit: Is there a specific exception we're trying to catch here, or just generally making sure nothing goes wrong? If it's the former then catching the specific exception here instead of a generic Exception would make it clearer what could go wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question.

Here the intention for the metric is to catch if a rebuild fails for any reason. The idea is establish a relationship between a rebuild failure and the number of rows a datasource have. It has been observed a rebuild fails for some datasource with huge no. of records.

On other note, I like the point of having clarity for what went wrong. Since we are not sure of the reason, I am wondering if the reason is being captured today. If not, we should probably log them into sentry. Will look into this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're not entirely sure what might go wrong, and if there are a few things that could go wrong, then totally fine to leave as is. Could be a good followup if we discover specific errors that we are expecting.

corehq/apps/userreports/tasks.py Outdated Show resolved Hide resolved
Copy link
Contributor

@mkangia mkangia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you already have a number of reviewers active on this already so I didn't look in depth.
Just overall, excited for having metrics for UCRs.

I see we are adding quite a few so just wanted to flag that we should be clear on the "relevance" of the metrics and there usability. We are adding quite a few queries to fetch the metrics so it would be good to make sure that these are going to be useful and will be used.

No pressure but Is it possible to see how they are appearing on datadog, in the PR description?

@memoized
def get_existing_table_from_db(self):
"""Loads existing table directly from database if one exists"""
if self.table_exists:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how this works so just checking on the implementation here.
We are checking for the table to exist before we instantiate it.

Is it making the code do more work then needed?

I would think we could directly instantiate and in case of table missing, there would be an exception that we could catch and return nothing. No?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. It does seem like the check looks for the table form the database so it is an additional hit to a database.
Can confirm that exception is raised here so will update this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 4864944



def _get_rows_count_from_existing_table(adapter):
if not adapter.table_exists:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as earlier comment about catching exception instead of checking for existing table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in 4864944

return
# Row count can only be obtained for synchronous rebuilds.
if not config.asynchronous:
rows_count_after_rebuild = adapter.get_query_object().count()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: minor but this would "expected" count after rebuild, though that count eventually could be higher or lower than this number here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that it would be the actual count as we are fetching the rows count from the database. Also testing confirms this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we calling this after the rebuild has finished?
It does seem that this is being called at times directly after queuing the rebuild task.

Is this the count from the table or the count from the query the table will perform?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we calling this after the rebuild has finished?
It does seem that this is being called at times directly after queuing the rebuild task.

Do you mean the get_query_object().count() method ? If yes, it is used at multiple places and may be called before the rebuild.

For this code, this particular method is called by _report_ucr_rebuild_metrics which is used here at the end after rebuild code is finished.

Is this the count from the table or the count from the query the table will perform?

It is the count from the table.

@ajeety4
Copy link
Contributor Author

ajeety4 commented Dec 13, 2024

I see we are adding quite a few so just wanted to flag that we should be clear on the "relevance" of the metrics and there usability. We are adding quite a few queries to fetch the metrics so it would be good to make sure that these are going to be useful and will be used.

Agreed.
The requirement document (attached in ticket) is quite clear on the intention behind them and how they would use them. (Good work from AE here)

No pressure but Is it possible to see how they are appearing on datadog, in the PR description?

Sure, i have deployed this on staging today. Will add them soon.

Copy link
Contributor

@kaapstorm kaapstorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with the changes, but I think the test failures are relevant.

Copy link
Contributor

@zandre-eng zandre-eng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good from my end.

@ajeety4
Copy link
Contributor Author

ajeety4 commented Dec 13, 2024

Hey @kaapstorm @zandre-eng ,
While building out some charts on Datadog using test data from staging, I realised that for the count of rebuild (and rebuild in place) action, there is not much value in looking at the count just domain wise especially considering the goal for it. As a result, I have included datasource_id just for these two metrics.
Related commit - 50accc3

@ajeety4
Copy link
Contributor Author

ajeety4 commented Dec 13, 2024

Sure, i have deployed this on staging today. Will add them soon.

Hey @mkangia , included the screenshot in the description. The link is here in case you would be interested to have a better look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Open for review: do not merge A work in progress product/feature-flag Change will only affect users who have a specific feature flag enabled product/invisible Change has no end-user visible impact
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants