Skip to content

[aw-failures] P0 recurrence: GitHub App installation rate-limit exhaustion blocks safe_outputs (2026-05-08 ~16:46–17:04 UTC) #31079

@github-actions

Description

@github-actions

Executive Summary

The GitHub App installation rate-limit exhaustion failure cluster — previously closed via #29318 (2026-04-30) and #29559 (2026-05-01 jitter fix) — has recurred today. In a ~30-minute window (16:46–17:04 UTC on 2026-05-08), the installation token quota was driven to 0/15000, causing safe_outputs failures across at least 2 main-branch agentic workflows. The agent jobs all completed successfully — the work was lost in the post-agent push step.

This cluster has the highest blast radius of any failure observed in the 6h window because the agent already paid full token cost and produced valid output before the safe_outputs layer dropped it.

Failure clusters (rate-limit cluster only)

Workflow Run Started (UTC) Outcome
Documentation Unbloat §25568028328 16:51 safe_outputs failed: create_pull_request rate-limited, fallback-as-issue disabled, work lost
Slide Deck Maintainer §25567564480 16:41 safe_outputs failed: PR creation and fallback issue both rate-limited, work lost

Evidence

Documentation Unbloat — safe_outputs log (job §75058317395)
17:04:38  Processing create_pull_request: title=docs: trim bloat from multi-repo examples (-44%), bodyLength=2005
17:04:38  ##[warning]Failed to fetch repository default branch: API rate limit exceeded for installation. ... timestamp 2026-05-08 17:04:37 UTC
17:04:39  ##[warning]pushSignedCommits: GraphQL signed push failed, falling back to git push: API rate limit exceeded for installation. ...
17:04:41  ##[warning]Failed to create pull request: API rate limit exceeded for installation. ...
17:04:41  ##[error]fallback-as-issue is disabled - not creating fallback issue
17:04:41  ##[error]✗ Message 2 (create_pull_request) failed: API rate limit exceeded for installation
17:04:41  ##[warning]⚠️ Code push operation 'create_pull_request' failed — remaining safe outputs will be cancelled
17:04:41  Failed: 1
17:04:41    Types: upload_asset
17:04:41  ##[error]1 safe output(s) failed
Slide Deck Maintainer — safe_outputs log (job §75055609089)
16:46:46  ##[warning]⚠️ Rate-limit headroom low: 0/15000 requests remaining (0%) [safe_outputs_pre_check]
16:46:46  ##[warning]Failed to fetch repository default branch: API rate limit exceeded for installation
16:46:46  ##[warning]pushSignedCommits: GraphQL signed push failed, falling back to git push: API rate limit exceeded
16:46:48  ##[warning]Failed to create pull request: API rate limit exceeded for installation
16:46:48  ##[error]Failed to create both pull request and fallback issue. PR error: API rate limit exceeded ... Issue error: API rate limit exceeded
16:46:48  ##[error]✗ Message 1 (create_pull_request) failed

Existing issue correlation

The jitter fix in #29559 reduced but did not eliminate concurrent burst-induced rate-limit exhaustion. Today's burst at 16:41–16:51 UTC suggests two scheduled workflows still landed within the same minute window (or other unjittered code-push workflows piled on top).

Proposed fix roadmap

P0 — stop dropping work

  • When safe_outputs push receives an API rate limit exceeded error, treat it as transient rather than terminal:
    • Persist the prepared payload (PR body, patch, asset) to a workflow artifact so a follow-up run can recover.
    • Re-enable fallback-as-issue for the Documentation Unbloat workflow (currently disabled per log line fallback-as-issue is disabled - not creating fallback issue), or implement a generic deferred-write path.
  • Add exponential back-off + retry-after honoring X-RateLimit-Reset for PR creation / push paths (current retry exhausted in seconds).

P1 — reduce burst probability

  • Audit cron schedules of all daily code-push workflows; verify the jitter from [deep-report] Add schedule jitter to daily workflows — prevent installation token rate-limit bursts #29559 is actually applied and unique per workflow.
  • Identify the specific co-scheduled workflows around 16:40–17:00 UTC (Documentation Unbloat is hourly per its frontmatter; Slide Deck Maintainer is presumably daily — confirm and stagger).
  • Consider a global concurrency gate on safe_outputs code-push operations (max N concurrent across the org/installation) to keep token usage under quota even during legitimate bursts.

P2 — observability

  • Surface Rate-limit headroom low: 0/15000 in the agent's report-failure-as-issue body (currently only the worker logs see this — the auto-failure issue is generic).
  • Emit a metric / OTLP span when safe_outputs_pre_check headroom drops below 10% so this is visible in the existing Sentry endpoint.

Verification / success criteria

  • Zero safe_outputs failures with API rate limit exceeded in any agentic workflow run for 7 consecutive days.
  • For any future rate-limit event, agent output is preserved (artifact or fallback issue) — not silently lost.

Confidence & unknowns

  • Confidence: high that the two listed runs failed due to installation-token rate-limit exhaustion (explicit 0/15000 headroom and PR/fallback both reject with the same error).
  • Unknown: the agentic-workflows MCP logs tool returned only 15 runs for the 6h window, while gh api reports 30+ workflow run failures (most non-agentic CI). There may be additional agentic runs in the same burst that were filtered out — worth re-running this investigation with broader filters once rate limit recovers.
  • Unknown: whether the jitter fix from [deep-report] Add schedule jitter to daily workflows — prevent installation token rate-limit bursts #29559 was fully landed; spot-checking a daily cron file would confirm.

References

Parent: #30961

Generated by [aw] Failure Investigator (6h) · ● 14.6M ·

  • expires on May 15, 2026, 7:22 PM UTC

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions