Sporadic 403 "You have exceeded a secondary rate limit” errors #276

daiglebg · 2022-01-19T13:40:09Z

Our org currently has Bulldozer configured for 30 repos and 300+ open pull requests. We are seeing sporadic “403 You have exceeded a secondary rate limit” errors in our Bulldozer logs. Is it possible that Bulldozer is running too many concurrent api requests and github is throwing the 403 error? An example of the complete error is:

{
  "level": "error",
  "rid": "c7it9upsbsbhkcad330g",
  "github_event_type": "push",
  "github_delivery_id": "26036a76-77d6-11ec-9293-e49c7563bcc0",
  "github_installation_id": 6133582,
  "github_repository_owner": "OUR_ORG",
  "github_repository_name": "our_repo",
  "github_pr_num": 2424,
  "error": "unable to determine update status: failed to determine if pull request is ignored: failed to list pull request comments: GET https://api.github.com/repos/OUR_ORG/our_repo/pulls/2424/comments?per_page=100: 403 You have exceeded a secondary rate limit. Please wait a few minutes before you try again. []\ngithub.com/palantir/bulldozer/pull.(*GithubContext).Comments\n\t/go/src/github.com/palantir/bulldozer/pull/github_context.go:101\ngithub.com/palantir/bulldozer/bulldozer.(*Signals).Matches\n\t/go/src/github.com/palantir/bulldozer/bulldozer/signals.go:72\ngithub.com/palantir/bulldozer/bulldozer.IsPRIgnored\n\t/go/src/github.com/palantir/bulldozer/bulldozer/evaluate.go:30\ngithub.com/palantir/bulldozer/bulldozer.ShouldUpdatePR\n\t/go/src/github.com/palantir/bulldozer/bulldozer/update.go:35\ngithub.com/palantir/bulldozer/server/handler.(*Base).UpdatePullRequest\n\t/go/src/github.com/palantir/bulldozer/server/handler/base.go:91\ngithub.com/palantir/bulldozer/server/handler.(*Push).Handle\n\t/go/src/github.com/palantir/bulldozer/server/handler/push.go:80\ngithub.com/palantir/go-githubapp/githubapp.Dispatch.Execute\n\t/go/src/github.com/palantir/bulldozer/vendor/github.com/palantir/go-githubapp/githubapp/scheduler.go:56\ngithub.com/palantir/go-githubapp/githubapp.(*scheduler).safeExecute\n\t/go/src/github.com/palantir/bulldozer/vendor/github.com/palantir/go-githubapp/githubapp/scheduler.go:183\ngithub.com/palantir/go-githubapp/githubapp.QueueAsyncScheduler.func1\n\t/go/src/github.com/palantir/bulldozer/vendor/github.com/palantir/go-githubapp/githubapp/scheduler.go:257\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357",
  "time": "2022-01-17T20:44:20.087614922Z",
  "message": "Error updating pull request"
}

The text was updated successfully, but these errors were encountered:

bluekeyes · 2022-01-19T18:18:34Z

Based on the error you shared, I'm guessing you have automatic branch updates enabled for this repository? That feature can definitely be expensive when a repository has many open pull requests that need to be updated. Each time a PR merges, Bulldozer needs to evaluate all open PRs to see if they need updates and the required API calls can add up. While it may not fit with your workflow, you can reduce requests by making updates opt-in (via the trigger configuration) rather than opt-out (via the ignore configuration.)

Otherwise, to further diagnose this, I think we'll want to look at the GitHub request logs. If you have debug logging enabled in Bulldozer, you can look at all messages with the github_request message. This should show when bursts of requests happen and which API endpoints are requested.

The request logs will also show the cache status of requests. Bulldozer maintains an in-memory cache for GitHub responses that is 50MB by default. One thing you could try is increasing the size of this cache and then checking request logs to see if cache hits have increased:

cache:
  max_size: "100MB"

Cache metrics can also be exported to Datadog, but unfortunately, we don't support any other metric exporters yet.

daiglebg · 2022-01-20T15:22:17Z

Thank you for the reply. The information you supplied is very helpful. Our current bulldozer configuration is:

  default_repository_config:
    merge:
      trigger:
        labels: ["auto merge"]
        comment_substrings: ["==MERGE_WHEN_READY=="]
      ignore:
        labels: ["no auto merge"]
        comment_substrings: ["==DO_NOT_MERGE=="]
      method: squash
      options:
        squash:
          body: pull_request_body
          message_delimiter: <!-- COMMIT MESSAGE MARKER - DO NOT DELETE THIS -->
      delete_after_merge: true
    update:
      trigger:
        labels: ["auto update", "auto merge"]
      ignore:
        labels: ["no auto update"]

Currently in our 30 repos, we have 57 pull requests with the “auto update” label and 11 with the “auto merge” label applied. Many Bulldozer operations run successfully however, they start failing with the “403 You have exceeded a secondary rate limit. Please wait a few minutes before you try again” error, which I think indicates Bulldozer may be doing too many api requests, too quickly and the secondary rate limit kicks in. Could 68 PRs being processed by Bulldozer cause this? Is there a way to slow down some of the Bulldozer operations, e.g some delay variable setting etc?

We do have debug configured and I have access to the Bulldozer logs with a history of calls. I am attaching a compressed log, with our org and repo names replaced, but it has all other data. Please let me know if you need any more information.

In the attached log, you'll see that "OUR_REPO_7" seems to have more issues. That is our biggest repo and has the most PRs and therefore the most Bulldozer operations.

bulldozer.log.gz

bluekeyes · 2022-01-20T23:57:07Z

Thanks for the logs. I found at least one request, c7k1sfpsbsbq0kn51rq0, that seems problematic. This request called the Merge a branch API 11 times (for the 11 tagged PRs) in the span of about 33 seconds. GitHub isn't very transparent about how the secondary rate limits work, but generally request that make modifications or are expensive to process (like merges) will use up the secondary rate limit faster.

That seems like what happened here: the first API request made after the 11th POST returned a 403 response due to the rate limit.

When we process PRs for updates, we just run through them in a loop, as fast as possible, so it's not too surprising that updating many PRs could trigger the secondary limits. We should probably add a delay to this processing, either fixed or dynamic based on the number of PRs to process.

I'll need to think about this a bit more to make sure there are no negative consequences to adding the delay, but if there aren't, it should be a relatively simple change.

daiglebg · 2022-01-21T11:22:42Z

Thank you. The ability to have some kind of delay could help. As you can see from our bulldozer.yml file, do you think us having "auto merge" in both the "merge" and "update" triggers could be causing this behavior? We added this label to the "update" trigger so it would force updates w/o requiring developers set both "auto update" and "auto merge" labels.

The c7k1sfpsbsbq0kn51rq0 appears to be an old Draft PR. I just tried adding the "ignore_drafts: true" option under the "update" section but the Bulldozer logs show the following errors:

Error: failed to read server config: failed parsing server config: failed unmarshaling yaml: yaml: unmarshal errors:
  line 130: field ignore_drafts not found in type bulldozer.UpdateConfig
failed to read server config: failed parsing server config: failed unmarshaling yaml: yaml: unmarshal errors:
  line 130: field ignore_drafts not found in type bulldozer.UpdateConfig

We are using the latest "develop" branch of Bulldozer.

On another note, a nice feature would be the ability to exclude certain PRs from being processed at all.

UPDATE: I removed the "auto merge" trigger label from the update section but we still saw these the errors so I put back our original configuration.

balint-bg · 2022-01-21T20:06:03Z

either fixed or dynamic based on the number of PRs to process.

A few other options:

a backoff strategy, where each call would be delayed by progressively more delays. Either:
- linear, or
- by some exponential factor
step-wise (i.e. between each 10th call, sleep for X amount of time.). This step-wise strategy wouldn't impact anyone with low number of PRs but would help with those with dozens of open PRs to process. Wouldn't even need a config option potentially, as long as the X is well-chosen :) Probably easy to implement too.

Just a thought.

balint-bg · 2022-01-21T20:40:18Z

Yet another option, perhaps even cleaner: use a custom label on PRs that need to be ignored by bulldozer. Then here:

bulldozer/pull/pull_requests.go

Line 70 in 3ae63f8

opts := &github.PullRequestListOptions{

filter out PRs that have that label on them.
It'd require organizations to have a mechanism in place to apply that label to their PRs, but it'd avoid having to artificially sleep anywhere between sending requests.

bluekeyes · 2022-01-22T02:13:27Z

It's strange that you got an error trying to use the ignore_drafts options. I tried quickly to reproduce this, but couldn't. Can you confirm that you are running Bulldozer 1.14.0 or newer? You can see the version in the health check output at https://your-bulldozer-host/api/health.

For the original issue, I've proposed two changes that might help:

Fetch configuration once when updating PRs #277 reduces the number of requests by removing some obvious unnecessary work
Add delay between pull request updates #278 adds a small fixed delay between queries and a larger exponential delay between updates

daiglebg · 2022-01-23T00:52:42Z

You’re correct. We spin up bulldozer docker containers using docker compose and I froze us to version 1.10.1. We once pulled in the latest docker container and our config broke, so I locked us to your palantirtechnologies/bulldozer:1.10.1 version a while ago.

Thanks for the fixes. To reproduce the original problem, I created 80 pull requests with the auto update label in a test repository and reproduced the secondary rate limit errors. fyi - they start happening approx 1 minute from the git push to our base branch.

TEST1:
I cloned your bulldozer repo and checked out the “pull/277” bkeyes/single-config-fetch branch and ran ./godelw docker build --verbose to build a new docker image from your branch. I used this image in our docker-compose file and verified we were using it:

curl -s http://localhost:8089/api/health
{status:ok,version:1.14.0-7-g43cafcd}

I updated our base branch in Github so the 80 test PRs (with the “auto update” label set) would trigger updates. Approx 1 minute in, I started seeing the same “secondary rate limit” errors.

TEST2:
Checked out your “pull/278” bkeyes/update-delay branch and ran ./godelw docker build --verbose to build a new docker image. I used this image in our docker-compose file and verified we were using it:

curl -s http://localhost:8089/api/health
{status:ok,version:1.14.0-8-g39bad1e}

This PR does fix the issue. I did not see any “secondary rate limit” errors, however for my 80 test PRs, it took over an hour to update all PRs. I do see your new messages in the log.

LAST TEST:
I changed your “PullUpdateMaxDelay = 60 * time.Second” to “PullUpdateMaxDelay = 2 * time.Second”, deployed the new image and there were still no errors.

I created 20 more PRs, for a total of 100 PRs for my test and there were still no errors. It looks like adding a small 2 second (or possibly less) delay is all that may be required to get around this problem.

I am attaching the bulldozer logs so you can see all the operations. One has your 60 second max delay default and the other is the 2 second max delay.

Would it be easy to make the “PullUpdateMaxDelay” configurable just in case the hardcoded value does not work for everyone?

Thanks again for the quick turnaround on these fixes. We depend on Bulldozer and it makes developers lives a lot easier around here.

prod_bulldozer_278_2s.gz
prod_bulldozer_278_60s.gz

bluekeyes · 2022-01-24T18:00:56Z

Thanks for the extensive testing! Based on your results, I'll definitely drop the default max delay to something much lower, like the 2 seconds you tested with. I'll also consider adding a configuration property, although I'd prefer to keep things simple if there's a fixed value that will work in almost all cases.

daiglebg · 2022-01-24T22:26:02Z

I've been running this change in our production environment since yesterday and i just saw the errors again. I'm sorry but the 2 second change may not be enough, I'm going to try value "PullUpdateMaxDelay = 5 * time.Second" for the new container and continue to monitor. I am attaching the logfile with the 2s delay so maybe you can see where the issue could be. I'm hoping we don't have to use the full 60s because of how long it takes to process all our open PRs.

prod_bulldozer_278_2s_with_errors.gz

daiglebg · 2022-01-25T18:10:19Z

Sorry, I clicked the close link by accident and reopened. We saw more secondary rate limit errors with the 5 second max setting. Trying again with "PullUpdateMaxDelay = 15 * time.Second"

bluekeyes · 2022-01-25T18:27:29Z

Thanks for the continued testing. I updated the PR yesterday to use a fixed 2s delay instead of exponential backoff (with a lower max value, the backoff seemed to happen too fast to make much of a difference), but I also added a configuration property to adjust the delay value. Based on your testing so far, do you think a fixed but configurable delay will work for you (if you can find the appropriate value)?

daiglebg · 2022-01-25T22:06:01Z

I’m still not clear why my test repository with 100 test PRs worked fine with your bkeyes/update-delay branch and my PullUpdateMaxDelay = 5 * time.Second, but our main production repo with 80 open PRs still had these failures.

I tried setting **PullUpdateMaxDelay = 15 * time.Second** and then **PullUpdateMaxDelay = 30 * time.Second** and I’m still seeing errors. The strange thing is the errors that came when using my 30 second change happened when the main repo was not even updated. I’m attaching the log that has 2 “secondary rate limit” errors and I’m not sure why. Hopefully you can read the logs to see why this hit the limit. The main repo in the attached log is the OUR_REPO_3 repository. Is it possible the delays that are being added, may need to be in another part of the app?

Also, when we created our "bulldozer" Github app, what/whos API key does this app use?

prod_bulldozer_278_30s_with_errors.gz

daiglebg · 2022-01-26T16:38:21Z

I pulled your latest bkeyes/update-delay branch (commit 44c959d). I changed your DefaultPullUpdateDelay = 2 * time.Second back to DefaultPullUpdateDelay = 60 * time.Second because we were seeing the secondary rate limit errors with the 2 second delay. There was just an update to our repo base branch that caused Bulldozer to run through all PRs and we’re still seeing the rate limit errors. I don’t understand why our production repo is hitting this issue when our test repo did not.

I looked at the https://docs.github.com/en/rest/guides/best-practices-for-integrators#dealing-with-abuse-rate-limits page and there is a line that says "If you're making a large number of POST, PATCH, PUT, or DELETE requests for a single user or client ID, wait at least one second between each request." Could we be hitting this?

I am attaching the bulldozer log for you. Please let me know if you need any more information.

prod_bulldozer_44c959d69_w_60s_delay.gz

daiglebg · 2022-01-31T20:23:32Z

Do you have an update on this?

daiglebg · 2022-02-01T16:59:32Z

The Github API doc for pull request merging warns of secondary rate limiting:

https://docs.github.com/en/rest/reference/pulls#merge-a-pull-request

bluekeyes added the bug Problems with existing functionality label Jan 20, 2022

This was referenced Jan 22, 2022

Fetch configuration once when updating PRs #277

Merged

Add delay between pull request updates #278

Open

daiglebg closed this as completed Jan 25, 2022

daiglebg reopened this Jan 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sporadic 403 "You have exceeded a secondary rate limit” errors #276

Sporadic 403 "You have exceeded a secondary rate limit” errors #276

daiglebg commented Jan 19, 2022 •

edited

Loading

bluekeyes commented Jan 19, 2022

daiglebg commented Jan 20, 2022 •

edited

Loading

bluekeyes commented Jan 20, 2022

daiglebg commented Jan 21, 2022 •

edited

Loading

balint-bg commented Jan 21, 2022

balint-bg commented Jan 21, 2022

bluekeyes commented Jan 22, 2022

daiglebg commented Jan 23, 2022

bluekeyes commented Jan 24, 2022

daiglebg commented Jan 24, 2022

daiglebg commented Jan 25, 2022 •

edited

Loading

bluekeyes commented Jan 25, 2022

daiglebg commented Jan 25, 2022 •

edited

Loading

daiglebg commented Jan 26, 2022 •

edited

Loading

daiglebg commented Jan 31, 2022

daiglebg commented Feb 1, 2022

Sporadic 403 "You have exceeded a secondary rate limit” errors #276

Sporadic 403 "You have exceeded a secondary rate limit” errors #276

Comments

daiglebg commented Jan 19, 2022 • edited Loading

bluekeyes commented Jan 19, 2022

daiglebg commented Jan 20, 2022 • edited Loading

bluekeyes commented Jan 20, 2022

daiglebg commented Jan 21, 2022 • edited Loading

balint-bg commented Jan 21, 2022

balint-bg commented Jan 21, 2022

bluekeyes commented Jan 22, 2022

daiglebg commented Jan 23, 2022

bluekeyes commented Jan 24, 2022

daiglebg commented Jan 24, 2022

daiglebg commented Jan 25, 2022 • edited Loading

bluekeyes commented Jan 25, 2022

daiglebg commented Jan 25, 2022 • edited Loading

daiglebg commented Jan 26, 2022 • edited Loading

daiglebg commented Jan 31, 2022

daiglebg commented Feb 1, 2022

daiglebg commented Jan 19, 2022 •

edited

Loading

daiglebg commented Jan 20, 2022 •

edited

Loading

daiglebg commented Jan 21, 2022 •

edited

Loading

daiglebg commented Jan 25, 2022 •

edited

Loading

daiglebg commented Jan 25, 2022 •

edited

Loading

daiglebg commented Jan 26, 2022 •

edited

Loading