-
-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Enterprise CommCare Client Version Compliance Report #35468
base: master
Are you sure you want to change the base?
Conversation
920074e
to
94f54c6
Compare
corehq/apps/enterprise/enterprise.py
Outdated
domain_mobile_workers = get_mobile_user_count(domain_obj.name, include_inactive=False) | ||
if domain_mobile_workers: | ||
total_mobile_workers += domain_mobile_workers | ||
total_up_to_date += domain_mobile_workers - len(self.rows_for_domain(domain_obj)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is definitely a performance concern here, as if this summary query takes 2 minutes, that is 2 minutes blocking a web worker. It's worth trying to see if Elasticsearch is any faster here. Otherwise, we may want to cache this statistic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried on production again, it takes between 2 minutes and 3 minutes. Last time I was about to leave my apartment so I didn't wait until it finished, so I guess last time when I query couch directly maybe it longer than 3 minutes. There is improvement by using elastic search. Is the improvement acceptable? I think maybe yes? Because not every enterprise account would have that amount of mobile workers...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're talking minute I think we need to think about caching or turning this into a celery task. I can imagine that multiple requests to this page could cause a disruption in our responsiveness that would be hard to diagnose
corehq/apps/enterprise/static/enterprise/js/project_dashboard.js
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the comments are non-blocking, but the most serious was just the performance concern for the summary tile. If Elasticsearch turns out to not be an option, or if that option is still too slow, we'd like to change this so that the data is cached, with the theory that we can stomach blocking our webworkers for 2 minutes every once in a while, but don't want to do so repeatedly.
Instead of caching (which still could generate some negative effects if multiple enterprises all hit the enterprise console at the same time), we could also look at changing the system entirely. Right now, the enterprise console sends out a request for each tile and essentially blocks until the webworker processes it. Instead, we could send an initial request to the webworker which offloaded the task to celery, and the page would then poll the webworker to see if the celery task had completed. Not something we need to do here, but its an option in general for this and other tiles |
1b67a86
to
d72d426
Compare
@mjriley Thank you for helping me resolving the merge issue. I have updated all my reply with the new SHA and it is ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes. I think the initial review is still largely the feeling -- I think the code itself is solid, but I'm concerned about a long-running task blocking our webworkers
The underscore is used as an alias for gettext translation function, so we can't use it to ignore the second return value
local variable 'version' referenced before assignment
This is the main breakthrough of the performance issue. When inspecting the profiling results, we noticed the bottleneck is _pickle.loads. CommCareBuildConfig is caching a whole couch document, which takes up a long time for pickle to deserialization.
best practice is not to reference style classes as logic elements in javascript. additionally css classes referenced by javascript should be prefixed with js-
It saves memory also saves 1 second for large enterprise account
The latest performance bottlenecks primarily in I/O operations (specifically in BufferedReader.readline).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great findings and improvements. Before I approve this, I just want to make sure our performance optimizations are worth it and don't have negative side effects. I left commends regarding our use of memoized
and running the operations in parallel.
corehq/apps/builds/utils.py
Outdated
except KeyError: | ||
continue | ||
|
||
return None | ||
|
||
|
||
@memoized | ||
def get_build_time(version): | ||
build = CommCareBuild.get_build(version, latest=True) | ||
if build and build.time: | ||
return build.time | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's generally a good idea to keep exception handling as close as possible to the operation that can raise the exception -- that way an unexpected operation can't throw the exception that was only meant to be thrown by another operation. It looks like we're catching KeyError
due to CommCareBuild
-- could we catch KeyError
within get_build_time
instead? If a KeyError
is raised there, we can return None
, and then the continue in get_latest_version_at_time
doesn't seem necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Sorry this is being extracted into its own function and memoized in a rush. Fixed in bb87820
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Your version is an improvement, and it hides the exception complexity, but it still does more than is needed in the try/except block. CommCareBuild
is the statement that can throw, so its slightly better and more explicit to write:
try:
build = CommCareBuild.get_build(version, latest=True)
except KeyError:
return None
return build.time if build and build.time else None
because it scopes the exception block to just the code that can raise the exception. I also prefer to return an explicit value, rather than letting it implicitly return nothing.
corehq/apps/builds/utils.py
Outdated
return None | ||
|
||
|
||
@memoized |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of memoized
makes me nervous, as my understanding is that the cached value will stick around until it is explicitly cleared, or the machine restarts. Say we created a version, this ran for that version, and then we had to delete the version. This function would still return the old build time, despite the version no longer existing, right? Worse still, because the cache is local to the machine, one webworker might return the old value, while a different webworker, that hadn't previously processed the request, would return the current value. Does it seem safer to perform manual, per-request caching? I.e. create a dictionary and insert values as the request generates them, but throw that dictionary away at the end of the request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for noticing it! I verified in our doc:
Memoized on any other function/callable:
The cache lives on the callable, so if it’s globally scoped and never gets garbage collected, neither does the cache
So I'm changing it to a quickcache, and clear it every 10 minutes: 44d9db5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the intention 10 minutes? 100 * 60
seems to be 100 minutes, right? I'd still prefer a completely local, request-based caching mechanism, because I think it's more in-line with what you're trying to do, but I'll defer to your judgment here. This code works. The downside is that A) caching is harder to debug, B) quickcache is making redis/network requests, and C) we can still run into that inconsistent state due to a stale cache, but now that stale cache will expire after the timeout.
if version: | ||
version_in_use = format_commcare_version(version) if formatted else version | ||
|
||
return version_in_use, date_of_use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to convert date_of_use
from a string to a datetime here? It doesn't look like date_of_use
is referenced by the existing report functions, and this would simplify the logic in enterprise
, as we could remove the logic that checks whether this is a date or a string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer we just return what is stored in last_submission
and last_device
. get_commcare_version_and_date_from_last_usage
's only purpose should be get the info. "Formatting" should not be a part of it. String or datetime depends on the usage of this function, so I prefer let the function who call get_commcare_version_and_date_from_last_usage
decide which format they need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to take a step back. For most of our models, date fields are represented as a datetime
object, as is the case with last_submission and last_device. But, they can't be written as date objects to databases, because they are python objects -- so they get serialized as strings in some format. But those are internal details to the serialization process that should only be visible within the database writing/reading layer. Once the object is retrieved from the database, the object should return a datetime
to the user. We want to limit complexity, and carrying around a serialized date, rather than a date object, means the user always needs to remember that the string is in ISO format (which...is itself an implementation detail from DateTimeProperty)
get_commcare_version_and_date_from_last_usage
should be thought of as the database access layer. It should know those low-level details, but it should protect the user from them. date_of_use
is either last_submission.get('submission_date')
or last_device.get('last_used')
, but both are dealing with how DateTimeProperty
serializes dates. So -- the user shouldn't have to guess whether we're returning a date or a string, because hopefully we're consistent. Either ElasticSearch returns dates here, or it returns strings. In this case, it looks like it always returns an ISO string, but again, that's the database layer's implementation detail, and we're returning results back to a higher layer that doesn't need to know those details. IMO, this function should protect the user from these details and convert the ISO string back to a datetime
object.
On a less abstract level: the function is called get_commcare_version_and_date_from_last_usage
, but returns an ambiguous date or string. Say we had a function called get_total
that could return either an integer or a string representing an integer. Doesn't that seem less intuitive than always returning an integer?
if isinstance(date_of_use, str): | ||
date_of_use = datetime.strptime(date_of_use, ISO_DATETIME_FORMAT) | ||
date_of_use_minute_precision = date_of_use.replace(second=0, microsecond=0) | ||
|
||
latest_version_at_time_of_use = get_latest_version_at_time(config, date_of_use_minute_precision) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why it is safe to remove seconds and microseconds from date_of_use
? Admittedly its an edge case, but it seems possible for us to have multiple builds that differ only by seconds or milliseconds. Removing those would then give inaccurate information
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seconds and microseconds are removed from the form submission time, not our build. So the result is, for example, if we release a new version Version 3.3
at 13:10:30, a mobile worker submit a form using Version 3.2
in 13:10:40, after removing seconds, it is become 13:10:00, the latest version at 13:10:00 is still Version 3.2
, so this mobile worker won't be considered as using the out-of-date commcare version, although technically he is using out-of-date commcare version. I'm sacrificing a bit accuracy for caching the result of get_latest_version_at_time
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we choose to go forward with this approach, I'd want it documented that we're reducing the accuracy of results in order to improve performance. I'm not sure reducing the accuracy of report data for the sake of a quicker summary makes sense. However, can we avoid this entirely now? It looks like the main bottlenecks for this function were fetching the config and build time, but both of those things have now been moved into their own areas. The config is only fetched once, outside of get_latest_version_at_time
, and get_build_time
is now cached. So I don't think we need to do any caching on get_latest_version_at_time
anymore. At least on staging, our config has 65 items. I don't think it should be expensive to iterate backward through those items until we find a build time less than our target.
def _process_domains_in_parallel(process_func, domains): | ||
""" | ||
Helper method to process domains in parallel using gevent pooling. | ||
""" | ||
results = [] | ||
|
||
for chunk in chunked(domains, ES_CLIENT_CONNECTION_POOL_LIMIT): | ||
pool = Pool(size=ES_CLIENT_CONNECTION_POOL_LIMIT) | ||
chunk_results = pool.map(process_func, chunk) | ||
pool.join() | ||
results.extend(chunk_results) | ||
|
||
return results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious if this is worth doing. I'm assuming that ElasticSearch has a connection limit across machines, and so while we can attempt to use all of them to process a single request, and that request will now be faster, I'd imagine it comes at the expense of any other requests trying to use ElasticSearch. Specifically for computing Enterprise Tile totals, we are going to be running multiple queries in parallel already, as loading that console page triggers multiple tile total requests. If each of them tries to max out their elasticsearch connections, it seems like we're basically back to running operations in sequence, rather than in parallel?
@@ -270,13 +270,13 @@ hqDefine("enterprise/js/project_dashboard", [ | |||
); | |||
|
|||
kissmetrics.track.event(`[${metricType}] Visited page`); | |||
$(".report-panel").each(function () { | |||
$(".js-report-panel").each(function () { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@biyeun I did a brief search, but can you recommend any reading on this practice?
corehq/apps/enterprise/templates/enterprise/project_dashboard.html
Outdated
Show resolved
Hide resolved
so it can be cleared when timeout
…n raise the exception
Product Description
New Tile
Report
API response
Technical Summary
Ticket: https://dimagi.atlassian.net/browse/SAAS-16236
I'm a bit worried about the performance. When I test on production, it can take more than 2 minutes to generate total number for account like CRS
Feature Flag
Doing iterative release, so no FF. Once it is merged, we release it.
Safety Assurance
Safety story
Tested locally, on staging, and also on production.
Automated test coverage
Add test for util function I wrote
QA Plan
QA Ticket: https://dimagi.atlassian.net/browse/QA-7315
Verify the tile works as expected:
Email the report:
A mobile worker doesn't have any device yet, and haven't submitted any forms yet, will not be included in the report
A mobile worker whose latest submission is using the latest commcare version, will not be included in the report.
A mobile worker whose latest submission is using an older commcare version, (instruction to install an older version), this mobile worker will be included in the report. Then use the same mobile worker, but submitted a form using the "Log in as xx mobile worker" feature, verify this mobile worker is not in the report. Because I assume in this case we're always using the latest commcare version. (Using the same mobile worker is to verify the submission through "Log in as xx mobile worker" feature will replace the previous submission when calculate version number)
Can a mobile worker submitted form using script? If yes, please verify if the latest submission is by script, this mobile worker is not in the report. Because I assume in this case we're always using the latest commcare version.
Rollback instructions
Labels & Review