-
-
Notifications
You must be signed in to change notification settings - Fork 219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Application Version Compliance Report #35567
Conversation
🔍 Existing Issues For ReviewYour pull request is modifying functions with the following pre-existing issues: 📄 File: corehq/apps/enterprise/enterprise.py
Did you find this useful? React with a 👍 or 👎 |
…ument This commit improved performance by 92%
since we're no longer wrapping build doc
This commit improved performance by 92%
Improve performance by 20%
Improved performance by 30%
It increases num of calls to `get_app_builds` by 10%, but still saves a lot of time because most `get_app_builds` now only need to return first 10 builds
51b6ee6
to
848cf09
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good! Did leave some context/structural/reasoning questions.
corehq/apps/enterprise/enterprise.py
Outdated
all_builds = self.get_app_builds(domain, app_id, limit=self.INITIAL_QUERY_LIMIT) | ||
latest_version = self.get_latest_build_version_at_time(all_builds, build_version_date) | ||
if not latest_version: | ||
all_builds = self.get_app_builds(domain, app_id) | ||
latest_version = self.get_latest_build_version_at_time(all_builds, build_version_date) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like get_app_builds
is only used by get_latest_build_version_at_time
. Rather than just retrying the query here, could that if not
logic be moved (or removed) if get_app_builds
is called directly by get_latest_build_version_at_time
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on this? Here what I'm trying to do is first get the latest 10 builds, and see if we can identify the latest version at use time from this 10 builds, if not, then query again to get all builds, and try to identify the latest version from all builds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense what this is trying to do, I think what I'm getting at is that I think the retry logic should be handled in the method doing the fetching, and doesn't need to be seen by _get_user_builds
. Maybe I'm overcomplicating it/trying to overabstract things. But I'm imagining something like:
def get_latest_build_version_at_time(domain, app_id, time):
builds = self.get_app_builds(domain, app_id, limit=self.INITIAL_QUERY_LIMIT)
latest_build = self._get_latest_build_version_if_present(builds, time)
if latest_build is None:
builds = self.get_app_builds(domain, app_id, start=self.INITIAL_QUERY_LIMIT)
latest_build = self._get_latest_build_version_if_present(builds, time)
return latest_build
Where _get_latest_build_version_if_present
(or whatever naming) is handling the loop and returning None
if the latest isn't in the builds that were passed in, and get_app_builds
is modified to take a start
parameter or something similar, so it doesn't return the same initial 10 builds along with the rest of them. Then get_latest_build_version_at_time
will always return the actual latest build version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to separating out that logic. A helper method should be capable of doing everything to get the latest build version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored in 25e5e52, also updated the docstring!
corehq/apps/enterprise/enterprise.py
Outdated
if not latest_version: | ||
all_builds = self.get_app_builds(domain, app_id) | ||
latest_version = self.get_latest_build_version_at_time(all_builds, build_version_date) | ||
yield user, build, latest_version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love that this yields three unnamed items (as opposed to an object or namedtuple or something), but given it's a private method it's probably okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on how it is used, you don't need to return the user right? And it might make a bit more sense to return build['build_version'] instead of build right? That would help make this more readable to Evan's point too.
yield current_version, latest_version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_get_user_builds is also used in
commcare-hq/corehq/apps/enterprise/enterprise.py
Lines 686 to 698 in 25e5e52
for user, build, latest_version in self._get_user_builds(domain, apps): | |
if is_out_of_date(str(build['build_version']), str(latest_version)): | |
app_id = build['app_id'] | |
if app_id not in app_name_by_id: | |
app_name_by_id[app_id] = Application.get_db().get(app_id).get('name') | |
rows.append([ | |
user['username'], | |
domain, | |
app_name_by_id[app_id], | |
latest_version, | |
build['build_version'], | |
self.format_date(DateTimeProperty.deserialize(build['build_version_date'])), | |
]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least for me, I think its that the return isn't particularly intuitive. I don't expect that I'll receive user
, build
, and latest_version
from _get_user_builds
. Given that it isn't obvious what the return value is, there might be some benefit to instead return a dictionary with those values clearly named -- so now _get_user_builds
returns some collection of data, and the calling function explicitly referencing the named values makes what its doing a bit easier to follow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright! Returned a dictionary, also just return username instead of the whole user object in 4d009d5
Because this tile based on `LastBuild`, build will be updated by either a form submission, or a sync, or mobile heartbeat.
corehq/apps/enterprise/enterprise.py
Outdated
@@ -639,6 +639,7 @@ def rows_for_domain(self, domain_obj): | |||
class EnterpriseAppVersionComplianceReport(EnterpriseReport): | |||
title = gettext_lazy('Application Version Compliance') | |||
total_description = gettext_lazy('% of Applications Up-to-Date Across All Mobile Workers') | |||
INITIAL_QUERY_LIMIT = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you settle on the number 10?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think normally the version should be not too out-of-date, so I thought 10 might be a good number...
corehq/apps/enterprise/enterprise.py
Outdated
if not latest_version: | ||
all_builds = self.get_app_builds(domain, app_id) | ||
latest_version = self.get_latest_build_version_at_time(all_builds, build_version_date) | ||
yield user, build, latest_version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on how it is used, you don't need to return the user right? And it might make a bit more sense to return build['build_version'] instead of build right? That would help make this more readable to Evan's point too.
yield current_version, latest_version
corehq/apps/enterprise/enterprise.py
Outdated
all_builds = self.get_app_builds(domain, app_id, limit=self.INITIAL_QUERY_LIMIT) | ||
latest_version = self.get_latest_build_version_at_time(all_builds, build_version_date) | ||
if not latest_version: | ||
all_builds = self.get_app_builds(domain, app_id) | ||
latest_version = self.get_latest_build_version_at_time(all_builds, build_version_date) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to separating out that logic. A helper method should be capable of doing everything to get the latest build version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Mostly some minor suggestions that we could do to clarify some of the code, but do have some questions over whether some optimizations are necesssary
if is_out_of_date(str(build['build_version']), str(latest_version)): | ||
app_id = build['app_id'] | ||
if app_id not in app_name_by_id: | ||
app_name_by_id[app_id] = Application.get_db().get(app_id).get('name') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this optimization necessary? Having this module know about the structure of an application document seems like a violation of the Law of Demeter. Using Application.get(app_id).name
does not make as many assumptions. I get that using get_db()
avoids the extra wrap
call, but since we're caching in order to only ever fetch an application once, this seems like pre-mature optimization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caching improved performance by 92%, avoid wrap application object further improved performance by 20%. So I think it is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This definitely seems like the performance vs readability discussion. Now that it appears we no longer have the strict performance concerns (due to not populating the summary data), do you think we should emphasize the readability/maintainability side rather than the 20% performance gain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain more on why Application.get_db()
have readability issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the Law of Demeter, that I linked above. When modules start knowing about more than 1 level of internal structure of another module, things tend to get brittle -- unexpected errors crop up and the code is more difficult to maintain. I'm sure the link I provided above has a more full description. Application.get_db().get(app_id).get('name')
means that this enterprise module essentially knows about the data access layer. Not just how to query the database (Application.get_db().get(app_id)
) but also about how the database internally represents an Application. The application layer only generally wants to deal with fully constructed Application
objects, which gives us flexibility -- name
could be stored in an arbitrary manner in the database, but anyone interacting with the Application
object wouldn't need to know those details. With this code, we do.
corehq/apps/enterprise/enterprise.py
Outdated
if not latest_version: | ||
all_builds = self.get_app_builds(domain, app_id) | ||
latest_version = self.get_latest_build_version_at_time(all_builds, build_version_date) | ||
yield user, build, latest_version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least for me, I think its that the return isn't particularly intuitive. I don't expect that I'll receive user
, build
, and latest_version
from _get_user_builds
. Given that it isn't obvious what the return value is, there might be some benefit to instead return a dictionary with those values clearly named -- so now _get_user_builds
returns some collection of data, and the calling function explicitly referencing the named values makes what its doing a bit easier to follow?
corehq/apps/enterprise/enterprise.py
Outdated
""" | ||
Get the latest build version available at the given time. | ||
|
||
:param domain: The domain of the app | ||
:param app_id: The application id | ||
:param time: A datetime object representing the date of the build version to compare against | ||
:return: The latest build version available at the given date | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these kind of docstrings don't add much value. The function is named get_latest_build_version_at_time
, which means the description of 'Get the latest build version available at the given time' is redundant. domain
is supposed to be standardized across our codebase to always represent the domain name as a string. app_id
is always a string. get_latest_build_version_at_time
suggests what time represents.
The return value is somewhat useful, as it seems we sometimes represent builds as strings, and othertimes as tuples -- but this docstring doesn't clarify that (and is something we should probably standardize without a comment)
Basically, we haven't standardized docstrings across our codebase, and I think we want to be a bit selective about only using them when needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from review: rename to get_latest_build_version
, and rename last parameter to at_datetime
and remove docstring? unless folks have other ideas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
corehq/apps/enterprise/enterprise.py
Outdated
def get_app_builds(self, domain, app_id, limit=None, start=0): | ||
if app_id not in self.builds_by_app_id: | ||
app_es = ( | ||
AppES() | ||
.domain(domain) | ||
.is_build() | ||
.app_id(app_id) | ||
.sort('version', desc=True) | ||
.is_released() | ||
.source(['_id', 'version', 'last_released', 'built_on']) | ||
.start(start) | ||
) | ||
if limit: | ||
app_es = app_es.size(limit) | ||
self.builds_by_app_id[app_id] = app_es.run().hits | ||
|
||
return self.builds_by_app_id[app_id] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I am curious if this would be better written in the reverse -- i.e. immediately return the cached build if it exists, otherwise go to elasticsearch. The downside to this approach is that it indents all the complicated logic. I think we also tend to think of cached data as performing this way -- "Is the value cached? Return it. Otherwise, look up the data and return it"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
corehq/apps/enterprise/enterprise.py
Outdated
if build_id in self.build_by_build_id: | ||
build_info = self.build_by_build_id[build_id] | ||
else: | ||
# last_released is added in 2019, build before 2019 don't have this field | ||
# TODO: have a migration to populate last_released from built_on | ||
# Then this code can be modified to use last_released only | ||
released_date = build_doc['last_released'] or build_doc['built_on'] | ||
build_info = { | ||
'version': build_doc['version'], | ||
'last_released': DateTimeProperty.deserialize(released_date) | ||
} | ||
self.build_by_build_id[build_id] = build_info |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this logic be better in its own function? This function has to do the loop comparison as well as handling the details of generating and caching build info. Seems like it might be cleaner to move the latter portion into a get_build_info
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
create new function get_build_info
from build doc for lines 776 to 788
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
corehq/apps/enterprise/enterprise.py
Outdated
def _find_latest_build_version_from_builds(self, all_builds, time): | ||
""" | ||
Get the latest build version at the time | ||
|
||
:param all_builds: List of raw build documents sorted by version in descending order | ||
:param time: A datetime object representing the date of the build version to compare against | ||
:return: The latest build version available at the given date. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is quite a bit of logic here that doesn't really interact with our databases. I'd like to see tests for this kind of thing in the future. I know personally I've overlooked edge cases when I don't write tests covering the different types of inputs. The other benefit is that these tests serve as documentation, which often make the docstrings here unnecessary -- people can look at tests to understand how the function can be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from review: rename time
to at_datetime
, remove doctstring. add tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
corehq/apps/enterprise/enterprise.py
Outdated
|
||
def total_for_domain(self, domain): | ||
app_ids = get_app_ids_in_domain(domain) | ||
total_last_builds = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
total_last_builds
was confusing to me. Would total_builds
convey the same thing? It makes more sense to me to think of how many total builds we have, and how many total builds are out of date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! 3a271f1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like this was updated in a different location
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is removed in 12a4ff6 since we're not displaying metric
corehq/apps/enterprise/enterprise.py
Outdated
builds = self.get_app_builds(domain, app_id, limit=self.INITIAL_QUERY_LIMIT) | ||
latest_build = self._find_latest_build_version_from_builds(builds, time) | ||
|
||
if latest_build is None: | ||
builds = self.get_app_builds(domain, app_id, start=self.INITIAL_QUERY_LIMIT) | ||
latest_build = self._find_latest_build_version_from_builds(builds, time) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this optimization? My guess is that applications don't get more than a few hundred releases in the worst case, and even in that worst case, we're only fetching _id
, version
, last_released
, and build_on
. Essentially, I would think the time difference between fetching this data for all builds vs just the 10 most recent is negligible, and would allow us to simplify the code (no limit
or start
).
That said, if you did performance testing and needed to do this optimization, that's probably a great thing to document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But since we are not displaying a metric in front end. I guess I can remove that logic too. 08443df
It will worsen the performance by 20%
…docstring can be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I have some remaining questions on our performance tradeoffs and some of the caching
build_info = self.build_info_cache.get(build_id) | ||
if not build_info: | ||
# last_released is added in 2019, build before 2019 don't have this field | ||
# TODO: have a migration to populate last_released from built_on | ||
# Then this code can be modified to use last_released only | ||
released_date = build_doc.get('last_released') or build_doc['built_on'] | ||
build_info = { | ||
'version': build_doc['version'], | ||
'last_released': DateTimeProperty.deserialize(released_date) | ||
} | ||
self.build_info_cache[build_id] = build_info |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain the role of build_info_cache
here? It looks like we're already caching the raw results from the elasticsearch query. Then, within _find_latest_build_version_from_builds
, we just want to turn that raw doc into something we can use to determine whether or not this build's release date was prior to the datetime specified. A conversion function doesn't sound like it would need a cache. Was this something that came up while profiling? If we are caching these results to avoid repeated calls to DateTimeProperty.deserialize
, perhaps that is something we can do when we first cache the elasticsearch results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
build_info_cache
is to cache all required information of a build, mainly version number and last released date. I did this because I realized deserialize also have an impact on performance, so I want to do less deserialization.
perhaps that is something we can do when we first cache the elasticsearch results?
This will result in deserialize unnecessary build, for example, if an app already have 1000 builds, then it is very likely the first 500 build's last released date will never be required in the comparison then no need to deserialize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you haven't seen it yet, there's an infamous statement about "Premature optimization is the root of all evil".
I wrote up a quick performance test to gauge how big of an impact deserialization had:
from datetime import datetime
from dimagi.utils.dates import safe_strftime
from dimagi.utils.parsing import ISO_DATETIME_FORMAT
from dimagi.ext.couchdbkit import DateTimeProperty
import random
TIMESTAMP_FLOOR = 1_000_000_000
TIMESTAMP_CEIL = 1_730_000_000
def create_datetime_string():
return safe_strftime(
datetime.fromtimestamp(
random.randrange(TIMESTAMP_FLOOR, TIMESTAMP_CEIL)
), ISO_DATETIME_FORMAT)
def create_n_dates(n):
return [create_datetime_string() for i in range(n)]
def time_deserialize():
num_elements = 1000
all_dates = create_n_dates(num_elements)
deserialized_dates = [None] * num_elements
start = datetime.now()
for i in range(num_elements):
deserialized_dates[i] = DateTimeProperty.deserialize(all_dates[i])
end = datetime.now()
print(f'Took: {end - start}')
TL;DR -- the time to deserialize 1K datestrings, on my machine, took 10K to 15K microseconds -- i.e. milliseconds.
From the current code, it appears build_info_cache
only cares about the build number and last released date -- I don't see any other data being used. Given that deserialization has virtually no performance impact, I think it makes the code cleaner for get_app_builds
to return only the needed information (version, last_released_date) and do the deserialization immediately. This allows us to get rid of build_info_cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I finally find the commit where I decide to deserialize separately without wrapping the raw document: f94b4b3
I think because I cached the wrapped result initially, so when I change to only deserialize the datetime without wrapping it, I never thought of getting rid of the cache.
I will put up a PR to remove the cache.
if is_out_of_date(str(build['build_version']), str(latest_version)): | ||
app_id = build['app_id'] | ||
if app_id not in app_name_by_id: | ||
app_name_by_id[app_id] = Application.get_db().get(app_id).get('name') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This definitely seems like the performance vs readability discussion. Now that it appears we no longer have the strict performance concerns (due to not populating the summary data), do you think we should emphasize the readability/maintainability side rather than the 20% performance gain?
corehq/apps/enterprise/enterprise.py
Outdated
for row_data in self._get_user_builds(domain, app_ids): | ||
version_in_use = str(row_data['build']['build_version']) | ||
latest_version = str(row_data['latest_version']) | ||
if is_out_of_date(version_in_use, latest_version): | ||
app_id = row_data['build']['app_id'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if this is something you need to act on, but wanted to bring it up -- I think I struggled to figure out what was going on here, because _get_user_builds
doesn't sound like it's going to be returning latest version information, and because row_data
doesn't really suggest what is being dealt with.
In pseudo-code, I'd imagine this is what we're trying to do:
retrieve app-last-usage information for all users in a domain
for each app-last-usage:
retrieve the most current version at the time the app was last used
if that version is not the same version as is in our information, then
add a row to our report with the relevant details
I'm not sure the above is any better, but if you feel there's some confusion with what is going on with row_data
, perhaps its worth addressing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I rename it to row_data
because build['build']['build_version]
looks weird. What do you think of change line 675 to for build_and_latest_version in self.all_last_builds_with_latest_version(domain, app_ids):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems like a good improvement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! df144d4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm approving to allow this tile to be live on production. I'm convinced this code works, but I hope we can clean up this code in the future to make it more readable and maintainable.
Product Description
Frontend
Report
Technical Summary
Ticket: https://dimagi.atlassian.net/browse/SAAS-16311
Review by commit. ( The commit history is basically how I improved the performance step by step )
This PR is to add a new Application Version Compliance Tile to the Enterprise Console.
This tile will iterate through all domain in this enterprise account, in each domain, iterate through all mobile workers, for each mobile worker, iterate through all their
last_build
inlast_builds
, (last_builds
include the last build of the application they installed), and see if the build is out of date, if it is, then append it to the report's row.Feature Flag
Safety Assurance
Safety story
Automated test coverage
QA Plan
Rollback instructions
Labels & Review