Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated key rotation to be more random #2504

Merged
merged 48 commits into from
Sep 1, 2023
Merged

Updated key rotation to be more random #2504

merged 48 commits into from
Sep 1, 2023

Conversation

sgoggins
Copy link
Member

@sgoggins sgoggins commented Aug 28, 2023

Description

  • Use of API keys was not being rotated, and so jobs were getting stuck.
  • This PR fixes the glitch.
  • This PR now also fixes an old issue with how the commit_count was retrieved by the repo_info task. Previously we selected a main branch. The new code gets the default branch. I have verified that where commits were getting stored in the metadata as NULL in ~25% of cases, and there are 5-10% of repos in any augur collection where main is not the default branch. Using the default branch always get commit_count data correctly. NO more NULL values. No more counting the "not default branch".
  • This PR now also fixes the refreshability of materialized views by adding unique keys on all but one, and enabling concurrent refreshes on all but one. It also removes 3 libyear related materialized views that need to be refactored.

@sgoggins sgoggins added bug Documents unexpected/wrong/buggy behavior bug-fix Fixes a bug labels Aug 28, 2023
Copy link
Contributor

@ABrain7710 ABrain7710 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure there is an issue with Github API keys being assigned randomly. I say this because the current implementation is as simple as it gets. We store a list of keys and assign a random key to every request we make. So unless the random.choice function isn't working, I don't think this is the issue.

# This will get a random index from the list and remove it,
# decreasing the likelihood of everyone using the same key all the time
#redis.rpop(self.redis_list_key)
redis.spop(self.redis_list_key)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't do anything because this isn't how we are getting the api keys from redis.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spop does a randomized retrieval, while rpop does it in the order. The issue with using the order is that we will never know and don't want to try to know how big each job is before we do it. This balances out our key counts. I proved this by changing it and running it and seeing a HUGE gain in collection rate.

@@ -16,6 +17,7 @@ def __init__(self, session: DatabaseSession, logger):

# gets the github api keys from the database via the GithubApiKeyHandler
github_api_keys = GithubApiKeyHandler(session).keys
github_api_keys = random.sample(github_api_keys, len(github_api_keys))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the goal here is to randomize the order. If so random.shuffle should be used since Random.sample is used to get a random subset of a list. This will randomly order the list but it is less clear what the goal is

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In short, we get more durable randomness from random.sample than random.shuffle ... random.shuffle reorders the original list, but it does this in memory, it doesn't reorder what is stored in redis. random.sample does a "better" job of getting us novel keys from a list every time. I tested both, and was reusing the same key about 3x as often with shuffle compared to random.

From Here: https://blog.enterprisedna.co/python-shuffle-list/

It’s essential to note that the shuffle() function returns None and modifies the original list or array. Therefore, it’s unsuitable for cases where you must maintain the original list order.

To return a new list containing elements from the original list without modifying it, you can use the sample() function from the random library:

self.logger.debug(f'valid keys AFTER shuffle: {valid_keys}')
except Exception as e:
self.logger.debug(f'{e}')
valid_keys = valid_now
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only gets executed the first time the GithubApiKeyHandler is created after that the keys will be stored in Redis so this function will return in the if redis_keys block.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's ok. What was happening is we'd always get them in table order, so by randomizing how redis is first populated we don't start with the same list, which makes the first hour of any restart go slower because it runs out of keys (at least going from previous, non-randomized configuration to the changes I made).

where = [WorkerOauth.access_token != self.config_key, WorkerOauth.platform == 'github']

return [key_tuple[0] for key_tuple in self.session.query(select).filter(*where).all()]
return [key_tuple[0] for key_tuple in self.session.query(select).filter(*where).order_by(func.random()).all()]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is needed as this is just going to randomize the order of the keys that are cached in Redis the first time the GithubApiKeyHandler is created. So this will just make the keys in Redis stored in a different order

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's ok. What was happening is we'd always get them in table order, so by randomizing how redis is first populated we don't start with the same list, which makes the first hour of any restart go slower because it runs out of keys (at least going from previous, non-randomized configuration to the changes I made).

@sgoggins
Copy link
Member Author

@IsaacMilarky
@ABrain7710

I was surprised by how much more effective introducing randomization in all three places than in any one of them extended or eliminated periods where the API Rate Limit was exceeded.

I tried each of them one at a time, and the triple shuffle has API key stops less than 1/2 the time as an single or combination shuffle.

NOTE: This PR now also includes a fix to the repo_info task. Previously we were pulling the main branch commits (actually, it was worse than that: it was master), so we were missing about 25% of commit counts, and getting counts off of sometimes "not the default branch". Now its accurate, and on the default branch commit count, so the alignment with our facade commit counting will be perfect. This makes it a lot easier to accurately verify commit collection.

@sgoggins sgoggins changed the base branch from main to dev August 30, 2023 15:28
@sgoggins
Copy link
Member Author

sgoggins commented Aug 30, 2023

I'm unsure there is an issue with Github API keys being assigned randomly. I say this because the current implementation is as simple as it gets. We store a list of keys and assign a random key to every request we make. So unless the random.choice function isn't working, I don't think this is the issue.

I noted this in my specific comments: but I went through several iterations trying to get all the data fully collected, and the three different shuffles occur at different points in execution and are causing key reuse at such a low rate that with a sufficient number of keys we hardly ever run out on a pretty large database.

Individually or in pairs, we reused the same key more often. So, if we think about how unpredictable job size will be, the fact that the triple approach is working with way less API key stalls, I think randomization is kind of an optimal way to do this; it worked really well on a large collection.

@ABrain7710
Copy link
Contributor

@sgoggins I made an additional observation on the key rotation issue in the discord here: https://discord.com/channels/839539671671504907/915622840798158859/1146192017420996738

Adding unique keys to enable concurrent refreshes to end blocking during refresh;
Changed all but one to do a concurrent refresh. Still looking for what may be unique.
Got rid of some Libyear related views that aren't quite what we want.
Signed-off-by: Sean P. Goggins <[email protected]>
…materialized view index

Signed-off-by: Sean P. Goggins <[email protected]>
…y, since they can now be refreshed concurrently

Signed-off-by: Sean P. Goggins <[email protected]>
… setting now that the refreshes are concurrent and will not lock the underlying tables

Signed-off-by: Sean P. Goggins <[email protected]>
Signed-off-by: Sean P. Goggins <[email protected]>
Signed-off-by: Sean P. Goggins <[email protected]>
Signed-off-by: Sean P. Goggins <[email protected]>
Signed-off-by: Sean P. Goggins <[email protected]>
Signed-off-by: Sean P. Goggins <[email protected]>
Signed-off-by: sgoggins <[email protected]>
@sgoggins sgoggins merged commit 51bc92f into dev Sep 1, 2023
1 check passed
@sgoggins sgoggins deleted the dev-stress-test branch September 27, 2023 17:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Documents unexpected/wrong/buggy behavior bug-fix Fixes a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants