Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Enterprise CommCare Client Version Compliance Report #35468

Open
wants to merge 40 commits into
base: master
Choose a base branch
from

Conversation

jingcheng16
Copy link
Contributor

@jingcheng16 jingcheng16 commented Dec 3, 2024

Product Description

New Tile
image

Report
image

API response
image

Technical Summary

Ticket: https://dimagi.atlassian.net/browse/SAAS-16236

I'm a bit worried about the performance. When I test on production, it can take more than 2 minutes to generate total number for account like CRS

Feature Flag

Doing iterative release, so no FF. Once it is merged, we release it.

Safety Assurance

Safety story

Tested locally, on staging, and also on production.

Automated test coverage

Add test for util function I wrote

QA Plan

QA Ticket: https://dimagi.atlassian.net/browse/QA-7315

Verify the tile works as expected:

  • An enterprise account without any mobile workers should see '--' shown up in tile.

Email the report:

  • A mobile worker doesn't have any device yet, and haven't submitted any forms yet, will not be included in the report

  • A mobile worker whose latest submission is using the latest commcare version, will not be included in the report.

  • A mobile worker whose latest submission is using an older commcare version, (instruction to install an older version), this mobile worker will be included in the report. Then use the same mobile worker, but submitted a form using the "Log in as xx mobile worker" feature, verify this mobile worker is not in the report. Because I assume in this case we're always using the latest commcare version. (Using the same mobile worker is to verify the submission through "Log in as xx mobile worker" feature will replace the previous submission when calculate version number)

  • Can a mobile worker submitted form using script? If yes, please verify if the latest submission is by script, this mobile worker is not in the report. Because I assume in this case we're always using the latest commcare version.

Rollback instructions

  • This PR can be reverted after deploy with no further considerations

Labels & Review

  • Risk label is set correctly
  • The set of people pinged as reviewers is appropriate for the level of risk of the change

@jingcheng16 jingcheng16 marked this pull request as ready for review December 3, 2024 18:46
@jingcheng16 jingcheng16 added the product/all-users-all-environments Change impacts all users on all environments label Dec 3, 2024
@jingcheng16 jingcheng16 force-pushed the jc/commcare_client_version_compliance_report branch from 920074e to 94f54c6 Compare December 3, 2024 20:55
corehq/apps/builds/utils.py Outdated Show resolved Hide resolved
corehq/apps/builds/utils.py Outdated Show resolved Hide resolved
domain_mobile_workers = get_mobile_user_count(domain_obj.name, include_inactive=False)
if domain_mobile_workers:
total_mobile_workers += domain_mobile_workers
total_up_to_date += domain_mobile_workers - len(self.rows_for_domain(domain_obj))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is definitely a performance concern here, as if this summary query takes 2 minutes, that is 2 minutes blocking a web worker. It's worth trying to see if Elasticsearch is any faster here. Otherwise, we may want to cache this statistic.

Copy link
Contributor Author

@jingcheng16 jingcheng16 Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried on production again, it takes between 2 minutes and 3 minutes. Last time I was about to leave my apartment so I didn't wait until it finished, so I guess last time when I query couch directly maybe it longer than 3 minutes. There is improvement by using elastic search. Is the improvement acceptable? I think maybe yes? Because not every enterprise account would have that amount of mobile workers...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're talking minute I think we need to think about caching or turning this into a celery task. I can imagine that multiple requests to this page could cause a disruption in our responsiveness that would be hard to diagnose

Copy link
Contributor

@mjriley mjriley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the comments are non-blocking, but the most serious was just the performance concern for the summary tile. If Elasticsearch turns out to not be an option, or if that option is still too slow, we'd like to change this so that the data is cached, with the theory that we can stomach blocking our webworkers for 2 minutes every once in a while, but don't want to do so repeatedly.

@mjriley
Copy link
Contributor

mjriley commented Dec 4, 2024

Instead of caching (which still could generate some negative effects if multiple enterprises all hit the enterprise console at the same time), we could also look at changing the system entirely. Right now, the enterprise console sends out a request for each tile and essentially blocks until the webworker processes it. Instead, we could send an initial request to the webworker which offloaded the task to celery, and the page would then poll the webworker to see if the celery task had completed. Not something we need to do here, but its an option in general for this and other tiles

@mjriley mjriley force-pushed the jc/commcare_client_version_compliance_report branch from 1b67a86 to d72d426 Compare December 9, 2024 17:52
@dimagimon dimagimon removed the Risk: Medium Change affects files that have been flagged as medium risk. label Dec 9, 2024
@jingcheng16
Copy link
Contributor Author

@mjriley Thank you for helping me resolving the merge issue. I have updated all my reply with the new SHA and it is ready for review.

Copy link
Contributor

@mjriley mjriley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes. I think the initial review is still largely the feeling -- I think the code itself is solid, but I'm concerned about a long-running task blocking our webworkers

biyeun and others added 6 commits December 11, 2024 10:03
The underscore is used as an alias for gettext translation function, so we can't use it to ignore the second return value
local variable 'version' referenced before assignment
@jingcheng16 jingcheng16 requested a review from mjriley December 12, 2024 03:59
jingcheng16 and others added 7 commits December 12, 2024 12:08
This is the main breakthrough of the performance issue. When inspecting the profiling results, we noticed the bottleneck is _pickle.loads. CommCareBuildConfig is caching a whole couch document, which takes up a long time for pickle to deserialization.
best practice is not to reference style classes as logic elements in javascript. additionally css classes referenced by javascript should be prefixed with js-
It saves memory also saves 1 second for large enterprise account
The latest performance bottlenecks primarily in I/O operations (specifically
in BufferedReader.readline).
Copy link
Contributor

@mjriley mjriley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great findings and improvements. Before I approve this, I just want to make sure our performance optimizations are worth it and don't have negative side effects. I left commends regarding our use of memoized and running the operations in parallel.

Comment on lines 75 to 86
except KeyError:
continue

return None


@memoized
def get_build_time(version):
build = CommCareBuild.get_build(version, latest=True)
if build and build.time:
return build.time
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's generally a good idea to keep exception handling as close as possible to the operation that can raise the exception -- that way an unexpected operation can't throw the exception that was only meant to be thrown by another operation. It looks like we're catching KeyError due to CommCareBuild -- could we catch KeyError within get_build_time instead? If a KeyError is raised there, we can return None, and then the continue in get_latest_version_at_time doesn't seem necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Sorry this is being extracted into its own function and memoized in a rush. Fixed in bb87820

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Your version is an improvement, and it hides the exception complexity, but it still does more than is needed in the try/except block. CommCareBuild is the statement that can throw, so its slightly better and more explicit to write:

try:
  build = CommCareBuild.get_build(version, latest=True)
except KeyError:
  return None
return build.time if build and build.time else None

because it scopes the exception block to just the code that can raise the exception. I also prefer to return an explicit value, rather than letting it implicitly return nothing.

return None


@memoized
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of memoized makes me nervous, as my understanding is that the cached value will stick around until it is explicitly cleared, or the machine restarts. Say we created a version, this ran for that version, and then we had to delete the version. This function would still return the old build time, despite the version no longer existing, right? Worse still, because the cache is local to the machine, one webworker might return the old value, while a different webworker, that hadn't previously processed the request, would return the current value. Does it seem safer to perform manual, per-request caching? I.e. create a dictionary and insert values as the request generates them, but throw that dictionary away at the end of the request.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for noticing it! I verified in our doc:

Memoized on any other function/callable:
The cache lives on the callable, so if it’s globally scoped and never gets garbage collected, neither does the cache

So I'm changing it to a quickcache, and clear it every 10 minutes: 44d9db5

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the intention 10 minutes? 100 * 60 seems to be 100 minutes, right? I'd still prefer a completely local, request-based caching mechanism, because I think it's more in-line with what you're trying to do, but I'll defer to your judgment here. This code works. The downside is that A) caching is harder to debug, B) quickcache is making redis/network requests, and C) we can still run into that inconsistent state due to a stale cache, but now that stale cache will expire after the timeout.

corehq/apps/enterprise/enterprise.py Outdated Show resolved Hide resolved
if version:
version_in_use = format_commcare_version(version) if formatted else version

return version_in_use, date_of_use
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to convert date_of_use from a string to a datetime here? It doesn't look like date_of_use is referenced by the existing report functions, and this would simplify the logic in enterprise, as we could remove the logic that checks whether this is a date or a string

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer we just return what is stored in last_submission and last_device. get_commcare_version_and_date_from_last_usage's only purpose should be get the info. "Formatting" should not be a part of it. String or datetime depends on the usage of this function, so I prefer let the function who call get_commcare_version_and_date_from_last_usage decide which format they need.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to take a step back. For most of our models, date fields are represented as a datetime object, as is the case with last_submission and last_device. But, they can't be written as date objects to databases, because they are python objects -- so they get serialized as strings in some format. But those are internal details to the serialization process that should only be visible within the database writing/reading layer. Once the object is retrieved from the database, the object should return a datetime to the user. We want to limit complexity, and carrying around a serialized date, rather than a date object, means the user always needs to remember that the string is in ISO format (which...is itself an implementation detail from DateTimeProperty)

get_commcare_version_and_date_from_last_usage should be thought of as the database access layer. It should know those low-level details, but it should protect the user from them. date_of_use is either last_submission.get('submission_date') or last_device.get('last_used'), but both are dealing with how DateTimeProperty serializes dates. So -- the user shouldn't have to guess whether we're returning a date or a string, because hopefully we're consistent. Either ElasticSearch returns dates here, or it returns strings. In this case, it looks like it always returns an ISO string, but again, that's the database layer's implementation detail, and we're returning results back to a higher layer that doesn't need to know those details. IMO, this function should protect the user from these details and convert the ISO string back to a datetime object.

On a less abstract level: the function is called get_commcare_version_and_date_from_last_usage, but returns an ambiguous date or string. Say we had a function called get_total that could return either an integer or a string representing an integer. Doesn't that seem less intuitive than always returning an integer?

Comment on lines +490 to +494
if isinstance(date_of_use, str):
date_of_use = datetime.strptime(date_of_use, ISO_DATETIME_FORMAT)
date_of_use_minute_precision = date_of_use.replace(second=0, microsecond=0)

latest_version_at_time_of_use = get_latest_version_at_time(config, date_of_use_minute_precision)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why it is safe to remove seconds and microseconds from date_of_use? Admittedly its an edge case, but it seems possible for us to have multiple builds that differ only by seconds or milliseconds. Removing those would then give inaccurate information

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seconds and microseconds are removed from the form submission time, not our build. So the result is, for example, if we release a new version Version 3.3 at 13:10:30, a mobile worker submit a form using Version 3.2 in 13:10:40, after removing seconds, it is become 13:10:00, the latest version at 13:10:00 is still Version 3.2, so this mobile worker won't be considered as using the out-of-date commcare version, although technically he is using out-of-date commcare version. I'm sacrificing a bit accuracy for caching the result of get_latest_version_at_time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we choose to go forward with this approach, I'd want it documented that we're reducing the accuracy of results in order to improve performance. I'm not sure reducing the accuracy of report data for the sake of a quicker summary makes sense. However, can we avoid this entirely now? It looks like the main bottlenecks for this function were fetching the config and build time, but both of those things have now been moved into their own areas. The config is only fetched once, outside of get_latest_version_at_time, and get_build_time is now cached. So I don't think we need to do any caching on get_latest_version_at_time anymore. At least on staging, our config has 65 items. I don't think it should be expensive to iterate backward through those items until we find a build time less than our target.

Comment on lines +574 to +586
def _process_domains_in_parallel(process_func, domains):
"""
Helper method to process domains in parallel using gevent pooling.
"""
results = []

for chunk in chunked(domains, ES_CLIENT_CONNECTION_POOL_LIMIT):
pool = Pool(size=ES_CLIENT_CONNECTION_POOL_LIMIT)
chunk_results = pool.map(process_func, chunk)
pool.join()
results.extend(chunk_results)

return results
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious if this is worth doing. I'm assuming that ElasticSearch has a connection limit across machines, and so while we can attempt to use all of them to process a single request, and that request will now be faster, I'd imagine it comes at the expense of any other requests trying to use ElasticSearch. Specifically for computing Enterprise Tile totals, we are going to be running multiple queries in parallel already, as loading that console page triggers multiple tile total requests. If each of them tries to max out their elasticsearch connections, it seems like we're basically back to running operations in sequence, rather than in parallel?

@@ -270,13 +270,13 @@ hqDefine("enterprise/js/project_dashboard", [
);

kissmetrics.track.event(`[${metricType}] Visited page`);
$(".report-panel").each(function () {
$(".js-report-panel").each(function () {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@biyeun I did a brief search, but can you recommend any reading on this practice?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
product/all-users-all-environments Change impacts all users on all environments
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants