Rate limit devices per user on short window #35613
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Product Description
A 406 http response is returned when this limit is reached, which allows us to send a custom, non-translatable message. However given the mobile worker will be connected to the internet, I'm optimistic they will be able to translate it or send it to someone who can make better use of the error message if needed. While a 429 response would have fit here as well and is translatable, the message is too vague in my opinion. This way we can ensure we are communicating to the user that it is usage for this specific user that is too high.
Based on android code, I'm pretty sure this will be displayed similarly on mobile, but will confirm with a physical device.
Technical Summary
https://dimagi.atlassian.net/browse/SAAS-16355
Initial PR: #35515
Note for reviewers
By default, the device rate limiter is not enabled, but the code is in place to have the ability to enable this via
update-config
. Review this code as if it were being enabled immediately, but note that there will be time after this is deployed to collect metrics in datadog to determine how the current limits fit into current usage.Summary
The changes in this PR track usage for each user action that leads to updating user metadata which are:
If the combined usage on these endpoints reaches 10 unique device IDs within a fixed minute time window, any other requests for that user within that minute from a new device ID are rejected. The exceptions are if a device ID is None or blank, or if the device ID is from Web Apps.
Feature Flag
Safety Assurance
Safety story
Based on Clayton's comment, here are the considerations outlined for this change.
How will users become aware of this change?
Assuming that this limit is currently being reached or exceeded (we won't know until metrics are collected), here is how users will become aware.
The mobile users will attempt an action like a sync or form submission, and receive an error message explaining, in english, that current usage for this user is too high and that they should try again in a minute. Based on numbers pulled for devices used per user in a day (hardly ever exceeds 100 devices), in most cases where this limit is hit, trying again in a minute should succeed. If however there are thousands of devices attempting to take an action at roughly the same time, it will be a painful process to have every device be able to restore or submit successfully. We suspect this scenario is limited to large events like trainings.
How are other legitimate use cases impacted?
Other use cases outside of simultaneous usage of a user across different physical devices that can lead to multiple device ids for one user are:
Uninstalling and reinstalling the CommCare app
It seems nearly impossible to uninstall and reinstall CommCare 9 times in one minute, let alone logging into your user and triggering a restore or submission, so this scenario should not be a concern.
Clearing user data
Clearing user data is easier to do frequently, and could lead to frequent device ID changes, but again, a user would have to do this 9 times in one minute. That seems well past the point of debugging an issue.
Multiple CommCare app installations
Running with multiple installations of CommCare seems to be the most plausible for actually sending off 10 requests from 10 different installations for one user, but just because it is plausible doesn't mean it makes sense. I'm not too familiar with the use cases for different installations, but I imagine it is testing behavior between different mobile app versions or something along those lines, which should be limited to a few different device IDs at most.
Does this impact our offering?
I would say this does not impact our offering, but does make it more painful to use the product in a way we did not intend for (a significant number of devices for one user). This is a temporary rate limit for what we, SaaS, consider to be a reasonable number of devices per user in a minute window.
Automated test coverage
Created tests for the DeviceRateLimiter class (
corehq.apps.users.tests.test_device_rate_limiter.TestDeviceRateLimiter
)QA Plan
Rollback instructions
Labels & Review
Still TODO