You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The GitHub App installation rate-limit exhaustion failure cluster — previously closed via #29318 (2026-04-30) and #29559 (2026-05-01 jitter fix) — has recurred today. In a ~30-minute window (16:46–17:04 UTC on 2026-05-08), the installation token quota was driven to 0/15000, causing safe_outputs failures across at least 2 main-branch agentic workflows. The agent jobs all completed successfully — the work was lost in the post-agent push step.
This cluster has the highest blast radius of any failure observed in the 6h window because the agent already paid full token cost and produced valid output before the safe_outputs layer dropped it.
The jitter fix in #29559 reduced but did not eliminate concurrent burst-induced rate-limit exhaustion. Today's burst at 16:41–16:51 UTC suggests two scheduled workflows still landed within the same minute window (or other unjittered code-push workflows piled on top).
Proposed fix roadmap
P0 — stop dropping work
When safe_outputs push receives an API rate limit exceeded error, treat it as transient rather than terminal:
Persist the prepared payload (PR body, patch, asset) to a workflow artifact so a follow-up run can recover.
Re-enable fallback-as-issue for the Documentation Unbloat workflow (currently disabled per log line fallback-as-issue is disabled - not creating fallback issue), or implement a generic deferred-write path.
Add exponential back-off + retry-after honoring X-RateLimit-Reset for PR creation / push paths (current retry exhausted in seconds).
Identify the specific co-scheduled workflows around 16:40–17:00 UTC (Documentation Unbloat is hourly per its frontmatter; Slide Deck Maintainer is presumably daily — confirm and stagger).
Consider a global concurrency gate on safe_outputs code-push operations (max N concurrent across the org/installation) to keep token usage under quota even during legitimate bursts.
P2 — observability
Surface Rate-limit headroom low: 0/15000 in the agent's report-failure-as-issue body (currently only the worker logs see this — the auto-failure issue is generic).
Emit a metric / OTLP span when safe_outputs_pre_check headroom drops below 10% so this is visible in the existing Sentry endpoint.
Verification / success criteria
Zero safe_outputs failures with API rate limit exceeded in any agentic workflow run for 7 consecutive days.
For any future rate-limit event, agent output is preserved (artifact or fallback issue) — not silently lost.
Confidence & unknowns
Confidence: high that the two listed runs failed due to installation-token rate-limit exhaustion (explicit 0/15000 headroom and PR/fallback both reject with the same error).
Unknown: the agentic-workflows MCP logs tool returned only 15 runs for the 6h window, while gh api reports 30+ workflow run failures (most non-agentic CI). There may be additional agentic runs in the same burst that were filtered out — worth re-running this investigation with broader filters once rate limit recovers.
Executive Summary
The GitHub App installation rate-limit exhaustion failure cluster — previously closed via #29318 (2026-04-30) and #29559 (2026-05-01 jitter fix) — has recurred today. In a ~30-minute window (16:46–17:04 UTC on 2026-05-08), the installation token quota was driven to 0/15000, causing
safe_outputsfailures across at least 2 main-branch agentic workflows. The agent jobs all completed successfully — the work was lost in the post-agent push step.This cluster has the highest blast radius of any failure observed in the 6h window because the agent already paid full token cost and produced valid output before the safe_outputs layer dropped it.
Failure clusters (rate-limit cluster only)
safe_outputsfailed:create_pull_requestrate-limited,fallback-as-issuedisabled, work lostsafe_outputsfailed: PR creation and fallback issue both rate-limited, work lostEvidence
Documentation Unbloat — safe_outputs log (job §75058317395)
Slide Deck Maintainer — safe_outputs log (job §75055609089)
Existing issue correlation
The jitter fix in #29559 reduced but did not eliminate concurrent burst-induced rate-limit exhaustion. Today's burst at 16:41–16:51 UTC suggests two scheduled workflows still landed within the same minute window (or other unjittered code-push workflows piled on top).
Proposed fix roadmap
P0 — stop dropping work
safe_outputspush receives anAPI rate limit exceedederror, treat it as transient rather than terminal:fallback-as-issuefor the Documentation Unbloat workflow (currently disabled per log linefallback-as-issue is disabled - not creating fallback issue), or implement a generic deferred-write path.X-RateLimit-Resetfor PR creation / push paths (current retry exhausted in seconds).P1 — reduce burst probability
safe_outputscode-push operations (max N concurrent across the org/installation) to keep token usage under quota even during legitimate bursts.P2 — observability
Rate-limit headroom low: 0/15000in the agent'sreport-failure-as-issuebody (currently only the worker logs see this — the auto-failure issue is generic).safe_outputs_pre_checkheadroom drops below 10% so this is visible in the existing Sentry endpoint.Verification / success criteria
safe_outputsfailures withAPI rate limit exceededin any agentic workflow run for 7 consecutive days.Confidence & unknowns
0/15000headroom and PR/fallback both reject with the same error).logstool returned only 15 runs for the 6h window, whilegh apireports 30+ workflow run failures (most non-agentic CI). There may be additional agentic runs in the same burst that were filtered out — worth re-running this investigation with broader filters once rate limit recovers.References
Parent: #30961