-
Notifications
You must be signed in to change notification settings - Fork 845
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated key rotation to be more random #2504
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure there is an issue with Github API keys being assigned randomly. I say this because the current implementation is as simple as it gets. We store a list of keys and assign a random key to every request we make. So unless the random.choice function isn't working, I don't think this is the issue.
# This will get a random index from the list and remove it, | ||
# decreasing the likelihood of everyone using the same key all the time | ||
#redis.rpop(self.redis_list_key) | ||
redis.spop(self.redis_list_key) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't do anything because this isn't how we are getting the api keys from redis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spop does a randomized retrieval, while rpop does it in the order. The issue with using the order is that we will never know and don't want to try to know how big each job is before we do it. This balances out our key counts. I proved this by changing it and running it and seeing a HUGE gain in collection rate.
@@ -16,6 +17,7 @@ def __init__(self, session: DatabaseSession, logger): | |||
|
|||
# gets the github api keys from the database via the GithubApiKeyHandler | |||
github_api_keys = GithubApiKeyHandler(session).keys | |||
github_api_keys = random.sample(github_api_keys, len(github_api_keys)) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the goal here is to randomize the order. If so random.shuffle should be used since Random.sample is used to get a random subset of a list. This will randomly order the list but it is less clear what the goal is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In short, we get more durable randomness from random.sample than random.shuffle ... random.shuffle reorders the original list, but it does this in memory, it doesn't reorder what is stored in redis. random.sample does a "better" job of getting us novel keys from a list every time. I tested both, and was reusing the same key about 3x as often with shuffle compared to random.
From Here: https://blog.enterprisedna.co/python-shuffle-list/
It’s essential to note that the shuffle() function returns None and modifies the original list or array. Therefore, it’s unsuitable for cases where you must maintain the original list order.
To return a new list containing elements from the original list without modifying it, you can use the sample() function from the random library:
self.logger.debug(f'valid keys AFTER shuffle: {valid_keys}') | ||
except Exception as e: | ||
self.logger.debug(f'{e}') | ||
valid_keys = valid_now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only gets executed the first time the GithubApiKeyHandler is created after that the keys will be stored in Redis so this function will return in the if redis_keys
block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's ok. What was happening is we'd always get them in table order, so by randomizing how redis is first populated we don't start with the same list, which makes the first hour of any restart go slower because it runs out of keys (at least going from previous, non-randomized configuration to the changes I made).
where = [WorkerOauth.access_token != self.config_key, WorkerOauth.platform == 'github'] | ||
|
||
return [key_tuple[0] for key_tuple in self.session.query(select).filter(*where).all()] | ||
return [key_tuple[0] for key_tuple in self.session.query(select).filter(*where).order_by(func.random()).all()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is needed as this is just going to randomize the order of the keys that are cached in Redis the first time the GithubApiKeyHandler is created. So this will just make the keys in Redis stored in a different order
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's ok. What was happening is we'd always get them in table order, so by randomizing how redis is first populated we don't start with the same list, which makes the first hour of any restart go slower because it runs out of keys (at least going from previous, non-randomized configuration to the changes I made).
…able: defaultBranchRef
I was surprised by how much more effective introducing randomization in all three places than in any one of them extended or eliminated periods where the API Rate Limit was exceeded. I tried each of them one at a time, and the triple shuffle has API key stops less than 1/2 the time as an single or combination shuffle. NOTE: This PR now also includes a fix to the repo_info task. Previously we were pulling the |
I noted this in my specific comments: but I went through several iterations trying to get all the data fully collected, and the three different shuffles occur at different points in execution and are causing key reuse at such a low rate that with a sufficient number of keys we hardly ever run out on a pretty large database. Individually or in pairs, we reused the same key more often. So, if we think about how unpredictable job size will be, the fact that the triple approach is working with way less API key stalls, I think randomization is kind of an optimal way to do this; it worked really well on a large collection. |
@sgoggins I made an additional observation on the key rotation issue in the discord here: https://discord.com/channels/839539671671504907/915622840798158859/1146192017420996738 |
Adding unique keys to enable concurrent refreshes to end blocking during refresh; Changed all but one to do a concurrent refresh. Still looking for what may be unique. Got rid of some Libyear related views that aren't quite what we want.
Signed-off-by: Sean P. Goggins <[email protected]>
Signed-off-by: Sean P. Goggins <[email protected]>
…materialized view index Signed-off-by: Sean P. Goggins <[email protected]>
…y, since they can now be refreshed concurrently Signed-off-by: Sean P. Goggins <[email protected]>
… setting now that the refreshes are concurrent and will not lock the underlying tables Signed-off-by: Sean P. Goggins <[email protected]>
Signed-off-by: Sean P. Goggins <[email protected]>
Signed-off-by: Sean P. Goggins <[email protected]>
…ripts Signed-off-by: Sean P. Goggins <[email protected]>
Signed-off-by: Sean P. Goggins <[email protected]>
Signed-off-by: Sean P. Goggins <[email protected]>
Signed-off-by: Sean P. Goggins <[email protected]>
Signed-off-by: Sean P. Goggins <[email protected]>
Signed-off-by: sgoggins <[email protected]>
Signed-off-by: sgoggins <[email protected]>
Description
main
branch. The new code gets thedefault branch
. I have verified that where commits were getting stored in the metadata as NULL in ~25% of cases, and there are 5-10% of repos in any augur collection wheremain
is not the default branch. Using the default branch always get commit_count data correctly. NO more NULL values. No more counting the "not default branch".