-
-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCR metrics using Datadog for Rebuild and Rebuild-in-place functionality #35496
base: master
Are you sure you want to change the base?
Conversation
… rebuild in place
if a column is added or removed in datasource config, it is not reflected on table until rebuild. Hence any query that runs on the table with updated config before rebuild fails. The new method added loads table directly from database and allows for queries on the existing table
443f34f
to
e37b241
Compare
corehq/apps/userreports/tasks.py
Outdated
if error: | ||
_report_metric_rebuild_error(config, action) | ||
if not error: | ||
_report_metric_increase_in_rows_count(config, action, adapter, rows_count_before_rebuild) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be refactored:
if error:
...
else:
...
It just reads better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in f4bc32e
@@ -386,6 +392,10 @@ def get_indicator_table(indicator_config, metadata, override_table_name=None): | |||
) | |||
|
|||
|
|||
def get_current_indicator_table_from_db(metadata, table_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this function used anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Forgotten cleanup.
Addressed in a5c51e5
corehq/apps/userreports/tasks.py
Outdated
def _get_rows_count_from_existing_table(adapter): | ||
table = adapter.get_existing_table_from_db() | ||
return adapter.session_helper.Session.query(table).count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ambivalent about this suggestion, because I like that your current implementation makes it clear that the value of rows_count_before_rebuild
can be None, and we aren't allowed to use type hints.
But if you move the if
statement inside the function, then you don't have to repeat it twice outside the function.
def _get_rows_count_from_existing_table(adapter): | |
table = adapter.get_existing_table_from_db() | |
return adapter.session_helper.Session.query(table).count() | |
def _get_rows_count_from_existing_table(adapter): | |
if not adapter.table_exists: | |
return None | |
table = adapter.get_existing_table_from_db() | |
return adapter.session_helper.Session.query(table).count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was ambivalent too on this and ended up keeping the check outside considering this is a private function (limited usage) and to keep function responsibility simple.
However after rethinking, I like your suggestion. It makes return explicit along with reducing duplication and the private function argument works both ways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in f4bc32e
corehq/apps/userreports/tasks.py
Outdated
if source not in ('edit_data_source_rebuild', 'edit_data_source_build_in_place'): | ||
return | ||
try: | ||
_report_metric_number_of_days_since_first_build(config.domain, config.get_id, action) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Supernit. Just passing config
would make this function signature consistent with the other two.
_report_metric_number_of_days_since_first_build(config.domain, config.get_id, action) | |
_report_metric_number_of_days_since_first_build(config, action) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in f4bc32e
corehq/apps/userreports/tasks.py
Outdated
if error: | ||
_report_metric_rebuild_error(config, action) | ||
if not error: | ||
_report_metric_increase_in_rows_count(config, action, adapter, rows_count_before_rebuild) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit.
if error: | |
_report_metric_rebuild_error(config, action) | |
if not error: | |
_report_metric_increase_in_rows_count(config, action, adapter, rows_count_before_rebuild) | |
if error: | |
_report_metric_rebuild_error(config, action) | |
else: | |
_report_metric_increase_in_rows_count(config, action, adapter, rows_count_before_rebuild) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in f4bc32e
metrics_counter('commcare.ucr.datasource.change_in_filters', tags={'domain': self.domain}) | ||
if 'configured_indicators' in self.changed_data: | ||
if len(self.instance.configured_indicators) > len(self.initial.get('configured_indicators', [])): | ||
metrics_counter('commcare.ucr.datasource.increase_in_columns', tags={'domain': self.domain}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we definitely not care about rebuilds where the user decreases the number of columns? (Maybe an AE question.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. I will check with AE on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Norman, I re-read the intention of this metric (frrom the document made by AE) and you are right in that we do not care about the decrease. It looks the rebuild should majorly be driven when there is need for an extra column.
Just quoting,
A rebuild datasource action should only be triggered if there is a new column
corehq/apps/userreports/util.py
Outdated
def get_configurable_and_static_reports_by_data_source(domain, data_source_id): | ||
reports = get_configurable_and_static_reports(domain) | ||
return [report for report in reports if report.config_id == data_source_id] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by_data_source
makes it sound like this function will return a dictionary.
I would either change it to for_data_source
or of_data_source
... or I would return a dictionary :)
def get_configurable_and_static_reports_by_data_source(domain, data_source_id): | |
reports = get_configurable_and_static_reports(domain) | |
return [report for report in reports if report.config_id == data_source_id] | |
def get_configurable_and_static_reports_by_data_source(domain): | |
reports_by_data_source = defaultdict(list) | |
for report in get_configurable_and_static_reports(domain): | |
reports_by_data_source[report.config_id].append(report) | |
return reports_by_data_source |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 011eb70
metrics_gauge( | ||
f'commcare.ucr.{action}.columns.count', | ||
len(config.get_columns()), | ||
tags={'domain': domain} | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think unless you tag the data source in some way, the value for the domain will change with every data source you send a metric for. I've used the data source's display name in this suggestion, but there is nothing that enforces uniqueness, so there is still the risk that the value will flip for different data sources with the samme display name. But on the other hand, maybe it's more usable than a config ID. 🤷
metrics_gauge( | |
f'commcare.ucr.{action}.columns.count', | |
len(config.get_columns()), | |
tags={'domain': domain} | |
) | |
metrics_gauge( | |
f'commcare.ucr.{action}.columns.count', | |
len(config.get_columns()), | |
tags={ | |
'domain': domain, | |
'data_source': config.display_name, | |
} | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
think unless you tag the data source in some way, the value for the domain will change with every data source you send a metric for. I
That is right and a valid concern.
The original requirement was for including datasource_id
. It was my recommendation to not include it considering below factors:
- The aim is to keep the cardinality(variance) to minimum in general to avoid increase in Datadog costs. Note that this is not a restriction though.
- Based on my the requirements, the idea is see the general trend of
how many a columns a datasource has while being rebuilt
instead of a report with datasource and their columns count. Since the ticket is specificaly focused to track the usage of Rebuild and Rebuild in Place functionality, a general trend (e.gaverage
) seems to be a reasonable insight. - For the DataDog visualisation, the intention is to show a average of above metric with an option to segment/filter by domain.
Let me know if this is something that sounds reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. That does sound reasonable. Just taking the average is an excellent approach.
metrics_gauge( | ||
f'commcare.ucr.{action}.reports_per_datasource.count', | ||
len(reports), | ||
tags={'domain': domain} | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I think you'll need to include the data source in the tag, otherwise the value for the domain will just keep alternating. Datadog will allow you to sum for all data sources in the domaain, or average across data sources. (Going with config ID this time just to be different :) )
metrics_gauge( | |
f'commcare.ucr.{action}.reports_per_datasource.count', | |
len(reports), | |
tags={'domain': domain} | |
) | |
metrics_gauge( | |
f'commcare.ucr.{action}.reports_per_datasource.count', | |
len(reports), | |
tags={ | |
'domain': domain, | |
'data_source': config.get_id, | |
} | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, just a few minor comments.
try: | ||
adapter.rebuild_table(initiated_by=initiated_by, source=source, skip_log=skip_log, diffs=diffs) | ||
_iteratively_build_table(config, limit=limit) | ||
except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Big nit: Is there a specific exception we're trying to catch here, or just generally making sure nothing goes wrong? If it's the former then catching the specific exception here instead of a generic Exception
would make it clearer what could go wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question.
Here the intention for the metric is to catch if a rebuild fails for any reason. The idea is establish a relationship between a rebuild failure and the number of rows a datasource have. It has been observed a rebuild fails for some datasource with huge no. of records.
On other note, I like the point of having clarity for what went wrong. Since we are not sure of the reason, I am wondering if the reason is being captured today. If not, we should probably log them into sentry. Will look into this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're not entirely sure what might go wrong, and if there are a few things that could go wrong, then totally fine to leave as is. Could be a good followup if we discover specific errors that we are expecting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you already have a number of reviewers active on this already so I didn't look in depth.
Just overall, excited for having metrics for UCRs.
I see we are adding quite a few so just wanted to flag that we should be clear on the "relevance" of the metrics and there usability. We are adding quite a few queries to fetch the metrics so it would be good to make sure that these are going to be useful and will be used.
No pressure but Is it possible to see how they are appearing on datadog, in the PR description?
@memoized | ||
def get_existing_table_from_db(self): | ||
"""Loads existing table directly from database if one exists""" | ||
if self.table_exists: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how this works so just checking on the implementation here.
We are checking for the table to exist before we instantiate it.
Is it making the code do more work then needed?
I would think we could directly instantiate and in case of table missing, there would be an exception that we could catch and return nothing. No?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point. It does seem like the check looks for the table form the database so it is an additional hit to a database.
Can confirm that exception is raised here so will update this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 4864944
corehq/apps/userreports/tasks.py
Outdated
|
||
|
||
def _get_rows_count_from_existing_table(adapter): | ||
if not adapter.table_exists: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as earlier comment about catching exception instead of checking for existing table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed in 4864944
return | ||
# Row count can only be obtained for synchronous rebuilds. | ||
if not config.asynchronous: | ||
rows_count_after_rebuild = adapter.get_query_object().count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: minor but this would "expected" count after rebuild, though that count eventually could be higher or lower than this number here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that it would be the actual count as we are fetching the rows count from the database. Also testing confirms this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we calling this after the rebuild has finished?
It does seem that this is being called at times directly after queuing the rebuild task.
Is this the count from the table or the count from the query the table will perform?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we calling this after the rebuild has finished?
It does seem that this is being called at times directly after queuing the rebuild task.
Do you mean the get_query_object().count()
method ? If yes, it is used at multiple places and may be called before the rebuild.
For this code, this particular method is called by _report_ucr_rebuild_metrics
which is used here at the end after rebuild code is finished.
Is this the count from the table or the count from the query the table will perform?
It is the count from the table.
Agreed.
Sure, i have deployed this on staging today. Will add them soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with the changes, but I think the test failures are relevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good from my end.
Hey @kaapstorm @zandre-eng , |
Product Description
Adds metrics related to Rebuild and Rebuild-in-place functionality to be captured on DataDog.
No user facing change.
DataDog Dashboard Screenshot
Technical Summary
Ticket.
This is a part of feature analytics wherein intention is to understand the usage of Rebuild and Rebuild-in-place functionality for UCRs.
Some of the metrics are nuanced and required either database access or some minor calculations. Any exceptions raised by them are ignored intentionally to not block the original functionality.
Feature Flag
User Configurable Reports
Safety Assurance
Safety story
Tested on local. Some metrics are simple and ones that involved code for metrics calculation will ignores any exception raised.
To be tested on staging
Automated test coverage
QA Plan
None
Rollback instructions
Labels & Review