Add Enterprise CommCare Client Version Compliance Report #35468

jingcheng16 · 2024-12-03T18:28:24Z

Product Description

New Tile

Report

API response

Technical Summary

Ticket: https://dimagi.atlassian.net/browse/SAAS-16236

I'm a bit worried about the performance. When I test on production, it can take more than 2 minutes to generate total number for account like CRS

Feature Flag

Doing iterative release, so no FF. Once it is merged, we release it.

Safety Assurance

Safety story

Tested locally, on staging, and also on production.

Automated test coverage

Add test for util function I wrote

QA Plan

QA Ticket: https://dimagi.atlassian.net/browse/QA-7315

Verify the tile works as expected:

An enterprise account without any mobile workers should see '--' shown up in tile.

Email the report:

A mobile worker doesn't have any device yet, and haven't submitted any forms yet, will not be included in the report
A mobile worker whose latest submission is using the latest commcare version, will not be included in the report.
A mobile worker whose latest submission is using an older commcare version, (instruction to install an older version), this mobile worker will be included in the report. Then use the same mobile worker, but submitted a form using the "Log in as xx mobile worker" feature, verify this mobile worker is not in the report. Because I assume in this case we're always using the latest commcare version. (Using the same mobile worker is to verify the submission through "Log in as xx mobile worker" feature will replace the previous submission when calculate version number)
Can a mobile worker submitted form using script? If yes, please verify if the latest submission is by script, this mobile worker is not in the report. Because I assume in this case we're always using the latest commcare version.

Rollback instructions

This PR can be reverted after deploy with no further considerations

Labels & Review

Risk label is set correctly
The set of people pinged as reviewers is appropriate for the level of risk of the change

corehq/apps/builds/utils.py

corehq/apps/builds/tests/test_utils.py

corehq/apps/enterprise/enterprise.py

mjriley · 2024-12-04T17:05:52Z

corehq/apps/enterprise/enterprise.py

+            domain_mobile_workers = get_mobile_user_count(domain_obj.name, include_inactive=False)
+            if domain_mobile_workers:
+                total_mobile_workers += domain_mobile_workers
+                total_up_to_date += domain_mobile_workers - len(self.rows_for_domain(domain_obj))


There is definitely a performance concern here, as if this summary query takes 2 minutes, that is 2 minutes blocking a web worker. It's worth trying to see if Elasticsearch is any faster here. Otherwise, we may want to cache this statistic.

Tried on production again, it takes between 2 minutes and 3 minutes. Last time I was about to leave my apartment so I didn't wait until it finished, so I guess last time when I query couch directly maybe it longer than 3 minutes. There is improvement by using elastic search. Is the improvement acceptable? I think maybe yes? Because not every enterprise account would have that amount of mobile workers...

If we're talking minute I think we need to think about caching or turning this into a celery task. I can imagine that multiple requests to this page could cause a disruption in our responsiveness that would be hard to diagnose

corehq/apps/enterprise/static/enterprise/js/project_dashboard.js

mjriley

Most of the comments are non-blocking, but the most serious was just the performance concern for the summary tile. If Elasticsearch turns out to not be an option, or if that option is still too slow, we'd like to change this so that the data is cached, with the theory that we can stomach blocking our webworkers for 2 minutes every once in a while, but don't want to do so repeatedly.

mjriley · 2024-12-04T18:23:23Z

Instead of caching (which still could generate some negative effects if multiple enterprises all hit the enterprise console at the same time), we could also look at changing the system entirely. Right now, the enterprise console sends out a request for each tile and essentially blocks until the webworker processes it. Instead, we could send an initial request to the webworker which offloaded the task to celery, and the page would then poll the webworker to see if the celery task had completed. Not something we need to do here, but its an option in general for this and other tiles

jingcheng16 · 2024-12-09T19:18:32Z

@mjriley Thank you for helping me resolving the merge issue. I have updated all my reply with the new SHA and it is ready for review.

corehq/apps/reports/standard/deployments.py

corehq/apps/reports/util.py

mjriley

Thanks for the changes. I think the initial review is still largely the feeling -- I think the code itself is solid, but I'm concerned about a long-running task blocking our webworkers

The underscore is used as an alias for gettext translation function, so we can't use it to ignore the second return value

…of_use

local variable 'version' referenced before assignment

This is the main breakthrough of the performance issue. When inspecting the profiling results, we noticed the bottleneck is _pickle.loads. CommCareBuildConfig is caching a whole couch document, which takes up a long time for pickle to deserialization.

best practice is not to reference style classes as logic elements in javascript. additionally css classes referenced by javascript should be prefixed with js-

It saves memory also saves 1 second for large enterprise account

The latest performance bottlenecks primarily in I/O operations (specifically in BufferedReader.readline).

mjriley

Great findings and improvements. Before I approve this, I just want to make sure our performance optimizations are worth it and don't have negative side effects. I left commends regarding our use of memoized and running the operations in parallel.

mjriley · 2024-12-13T16:03:22Z

corehq/apps/builds/utils.py

+        except KeyError:
+            continue
+
+    return None
+
+
+@memoized
+def get_build_time(version):
+    build = CommCareBuild.get_build(version, latest=True)
+    if build and build.time:
+        return build.time
+    return None


It's generally a good idea to keep exception handling as close as possible to the operation that can raise the exception -- that way an unexpected operation can't throw the exception that was only meant to be thrown by another operation. It looks like we're catching KeyError due to CommCareBuild -- could we catch KeyError within get_build_time instead? If a KeyError is raised there, we can return None, and then the continue in get_latest_version_at_time doesn't seem necessary.

Thank you! Sorry this is being extracted into its own function and memoized in a rush. Fixed in bb87820

Nit: Your version is an improvement, and it hides the exception complexity, but it still does more than is needed in the try/except block. CommCareBuild is the statement that can throw, so its slightly better and more explicit to write:

try: build = CommCareBuild.get_build(version, latest=True) except KeyError: return None return build.time if build and build.time else None

because it scopes the exception block to just the code that can raise the exception. I also prefer to return an explicit value, rather than letting it implicitly return nothing.

mjriley · 2024-12-13T16:18:16Z

corehq/apps/builds/utils.py

+    return None
+
+
+@memoized


The use of memoized makes me nervous, as my understanding is that the cached value will stick around until it is explicitly cleared, or the machine restarts. Say we created a version, this ran for that version, and then we had to delete the version. This function would still return the old build time, despite the version no longer existing, right? Worse still, because the cache is local to the machine, one webworker might return the old value, while a different webworker, that hadn't previously processed the request, would return the current value. Does it seem safer to perform manual, per-request caching? I.e. create a dictionary and insert values as the request generates them, but throw that dictionary away at the end of the request.

Thank you for noticing it! I verified in our doc:

Memoized on any other function/callable:
The cache lives on the callable, so if it’s globally scoped and never gets garbage collected, neither does the cache

So I'm changing it to a quickcache, and clear it every 10 minutes: 44d9db5

Is the intention 10 minutes? 100 * 60 seems to be 100 minutes, right? I'd still prefer a completely local, request-based caching mechanism, because I think it's more in-line with what you're trying to do, but I'll defer to your judgment here. This code works. The downside is that A) caching is harder to debug, B) quickcache is making redis/network requests, and C) we can still run into that inconsistent state due to a stale cache, but now that stale cache will expire after the timeout.

corehq/apps/enterprise/enterprise.py

mjriley · 2024-12-13T16:33:31Z

corehq/apps/reports/util.py

+    if version:
+        version_in_use = format_commcare_version(version) if formatted else version
+
+    return version_in_use, date_of_use


Would it be possible to convert date_of_use from a string to a datetime here? It doesn't look like date_of_use is referenced by the existing report functions, and this would simplify the logic in enterprise, as we could remove the logic that checks whether this is a date or a string

I prefer we just return what is stored in last_submission and last_device. get_commcare_version_and_date_from_last_usage's only purpose should be get the info. "Formatting" should not be a part of it. String or datetime depends on the usage of this function, so I prefer let the function who call get_commcare_version_and_date_from_last_usage decide which format they need.

I want to take a step back. For most of our models, date fields are represented as a datetime object, as is the case with last_submission and last_device. But, they can't be written as date objects to databases, because they are python objects -- so they get serialized as strings in some format. But those are internal details to the serialization process that should only be visible within the database writing/reading layer. Once the object is retrieved from the database, the object should return a datetime to the user. We want to limit complexity, and carrying around a serialized date, rather than a date object, means the user always needs to remember that the string is in ISO format (which...is itself an implementation detail from DateTimeProperty)

get_commcare_version_and_date_from_last_usage should be thought of as the database access layer. It should know those low-level details, but it should protect the user from them. date_of_use is either last_submission.get('submission_date') or last_device.get('last_used'), but both are dealing with how DateTimeProperty serializes dates. So -- the user shouldn't have to guess whether we're returning a date or a string, because hopefully we're consistent. Either ElasticSearch returns dates here, or it returns strings. In this case, it looks like it always returns an ISO string, but again, that's the database layer's implementation detail, and we're returning results back to a higher layer that doesn't need to know those details. IMO, this function should protect the user from these details and convert the ISO string back to a datetime object.

On a less abstract level: the function is called get_commcare_version_and_date_from_last_usage, but returns an ambiguous date or string. Say we had a function called get_total that could return either an integer or a string representing an integer. Doesn't that seem less intuitive than always returning an integer?

mjriley · 2024-12-13T16:35:59Z

corehq/apps/enterprise/enterprise.py

+            if isinstance(date_of_use, str):
+                date_of_use = datetime.strptime(date_of_use, ISO_DATETIME_FORMAT)
+            date_of_use_minute_precision = date_of_use.replace(second=0, microsecond=0)
+
+            latest_version_at_time_of_use = get_latest_version_at_time(config, date_of_use_minute_precision)


Can you explain why it is safe to remove seconds and microseconds from date_of_use? Admittedly its an edge case, but it seems possible for us to have multiple builds that differ only by seconds or milliseconds. Removing those would then give inaccurate information

Seconds and microseconds are removed from the form submission time, not our build. So the result is, for example, if we release a new version Version 3.3 at 13:10:30, a mobile worker submit a form using Version 3.2 in 13:10:40, after removing seconds, it is become 13:10:00, the latest version at 13:10:00 is still Version 3.2, so this mobile worker won't be considered as using the out-of-date commcare version, although technically he is using out-of-date commcare version. I'm sacrificing a bit accuracy for caching the result of get_latest_version_at_time.

If we choose to go forward with this approach, I'd want it documented that we're reducing the accuracy of results in order to improve performance. I'm not sure reducing the accuracy of report data for the sake of a quicker summary makes sense. However, can we avoid this entirely now? It looks like the main bottlenecks for this function were fetching the config and build time, but both of those things have now been moved into their own areas. The config is only fetched once, outside of get_latest_version_at_time, and get_build_time is now cached. So I don't think we need to do any caching on get_latest_version_at_time anymore. At least on staging, our config has 65 items. I don't think it should be expensive to iterate backward through those items until we find a build time less than our target.

mjriley · 2024-12-13T16:39:25Z

corehq/apps/enterprise/enterprise.py

+def _process_domains_in_parallel(process_func, domains):
+    """
+    Helper method to process domains in parallel using gevent pooling.
+    """
+    results = []
+
+    for chunk in chunked(domains, ES_CLIENT_CONNECTION_POOL_LIMIT):
+        pool = Pool(size=ES_CLIENT_CONNECTION_POOL_LIMIT)
+        chunk_results = pool.map(process_func, chunk)
+        pool.join()
+        results.extend(chunk_results)
+
+    return results


Curious if this is worth doing. I'm assuming that ElasticSearch has a connection limit across machines, and so while we can attempt to use all of them to process a single request, and that request will now be faster, I'd imagine it comes at the expense of any other requests trying to use ElasticSearch. Specifically for computing Enterprise Tile totals, we are going to be running multiple queries in parallel already, as loading that console page triggers multiple tile total requests. If each of them tries to max out their elasticsearch connections, it seems like we're basically back to running operations in sequence, rather than in parallel?

mjriley · 2024-12-13T16:41:48Z

corehq/apps/enterprise/static/enterprise/js/project_dashboard.js

@@ -270,13 +270,13 @@ hqDefine("enterprise/js/project_dashboard", [
        );

        kissmetrics.track.event(`[${metricType}] Visited page`);
-        $(".report-panel").each(function () {
+        $(".js-report-panel").each(function () {


@biyeun I did a brief search, but can you recommend any reading on this practice?

corehq/apps/enterprise/templates/enterprise/project_dashboard.html

so it can be cleared when timeout

…n raise the exception

jingcheng16 requested review from mjriley, millerdev, biyeun and AmitPhulera December 3, 2024 18:39

jingcheng16 marked this pull request as ready for review December 3, 2024 18:46

jingcheng16 added the product/all-users-all-environments Change impacts all users on all environments label Dec 3, 2024

jingcheng16 added 6 commits December 3, 2024 15:54

Refactor: rename function

52520df

Util to tell if a version is out of date

52b62e7

Add Commcare Version Compliance Tile

4c2c923

Add CommCare Version Compliance API

2ee9d14

Refactor: move get_report_task to the base class

5609175

Show -- for the total if the account has no mobile worker yet

94f54c6

jingcheng16 force-pushed the jc/commcare_client_version_compliance_report branch from 920074e to 94f54c6 Compare December 3, 2024 20:55