[aw-failures] P0 recurrence: GitHub App installation rate-limit exhaustion blocks safe_outputs (2026-05-08 ~16:46–17:04 UTC)

### Executive Summary

The **GitHub App installation rate-limit exhaustion** failure cluster — previously closed via #29318 (2026-04-30) and #29559 (2026-05-01 jitter fix) — has **recurred today**. In a ~30-minute window (16:46–17:04 UTC on 2026-05-08), the installation token quota was driven to **0/15000**, causing `safe_outputs` failures across at least 2 main-branch agentic workflows. The agent jobs all completed successfully — the work was lost in the post-agent push step.

This cluster has the highest blast radius of any failure observed in the 6h window because the agent already paid full token cost and produced valid output before the safe_outputs layer dropped it.

### Failure clusters (rate-limit cluster only)

| Workflow | Run | Started (UTC) | Outcome |
|---|---|---|---|
| Documentation Unbloat | [§25568028328](https://github.com/github/gh-aw/actions/runs/25568028328) | 16:51 | `safe_outputs` failed: `create_pull_request` rate-limited, `fallback-as-issue` disabled, work lost |
| Slide Deck Maintainer | [§25567564480](https://github.com/github/gh-aw/actions/runs/25567564480) | 16:41 | `safe_outputs` failed: PR creation **and** fallback issue both rate-limited, work lost |

### Evidence

<details>
<summary>Documentation Unbloat — safe_outputs log (job §75058317395)</summary>

```
17:04:38  Processing create_pull_request: title=docs: trim bloat from multi-repo examples (-44%), bodyLength=2005
17:04:38  ##[warning]Failed to fetch repository default branch: API rate limit exceeded for installation. ... timestamp 2026-05-08 17:04:37 UTC
17:04:39  ##[warning]pushSignedCommits: GraphQL signed push failed, falling back to git push: API rate limit exceeded for installation. ...
17:04:41  ##[warning]Failed to create pull request: API rate limit exceeded for installation. ...
17:04:41  ##[error]fallback-as-issue is disabled - not creating fallback issue
17:04:41  ##[error]✗ Message 2 (create_pull_request) failed: API rate limit exceeded for installation
17:04:41  ##[warning]⚠️ Code push operation 'create_pull_request' failed — remaining safe outputs will be cancelled
17:04:41  Failed: 1
17:04:41    Types: upload_asset
17:04:41  ##[error]1 safe output(s) failed
```

</details>

<details>
<summary>Slide Deck Maintainer — safe_outputs log (job §75055609089)</summary>

```
16:46:46  ##[warning]⚠️ Rate-limit headroom low: 0/15000 requests remaining (0%) [safe_outputs_pre_check]
16:46:46  ##[warning]Failed to fetch repository default branch: API rate limit exceeded for installation
16:46:46  ##[warning]pushSignedCommits: GraphQL signed push failed, falling back to git push: API rate limit exceeded
16:46:48  ##[warning]Failed to create pull request: API rate limit exceeded for installation
16:46:48  ##[error]Failed to create both pull request and fallback issue. PR error: API rate limit exceeded ... Issue error: API rate limit exceeded
16:46:48  ##[error]✗ Message 1 (create_pull_request) failed
```

</details>

### Existing issue correlation

- **#29318** (closed 2026-04-30) — original rate-limit burst at 11:05–11:41 UTC; same failure pattern (10 workflows hit, PR creation + fallback issue both fail).
- **#29559** (closed 2026-05-01) — schedule-jitter remediation tracking issue, marked closed.
- **#29540** (still open from 2026-05-01) — failure investigation report explicitly identified Cluster A as rate-limit at 12:05–12:28 UTC.
- **#27251 / #27258** (closed 2026-04-20) — earliest known instance.

The jitter fix in #29559 reduced but did **not eliminate** concurrent burst-induced rate-limit exhaustion. Today's burst at 16:41–16:51 UTC suggests two scheduled workflows still landed within the same minute window (or other unjittered code-push workflows piled on top).

### Proposed fix roadmap

**P0 — stop dropping work**
- When `safe_outputs` push receives an `API rate limit exceeded` error, treat it as **transient** rather than terminal:
  - Persist the prepared payload (PR body, patch, asset) to a workflow artifact so a follow-up run can recover.
  - Re-enable `fallback-as-issue` for the Documentation Unbloat workflow (currently disabled per log line `fallback-as-issue is disabled - not creating fallback issue`), or implement a generic deferred-write path.
- Add exponential back-off + retry-after honoring `X-RateLimit-Reset` for PR creation / push paths (current retry exhausted in seconds).

**P1 — reduce burst probability**
- Audit cron schedules of all daily code-push workflows; verify the jitter from #29559 is actually applied and unique per workflow.
- Identify the specific co-scheduled workflows around 16:40–17:00 UTC (Documentation Unbloat is hourly per its frontmatter; Slide Deck Maintainer is presumably daily — confirm and stagger).
- Consider a global concurrency gate on `safe_outputs` code-push operations (max N concurrent across the org/installation) to keep token usage under quota even during legitimate bursts.

**P2 — observability**
- Surface `Rate-limit headroom low: 0/15000` in the agent's `report-failure-as-issue` body (currently only the worker logs see this — the auto-failure issue is generic).
- Emit a metric / OTLP span when `safe_outputs_pre_check` headroom drops below 10% so this is visible in the existing Sentry endpoint.

### Verification / success criteria

- Zero `safe_outputs` failures with `API rate limit exceeded` in any agentic workflow run for 7 consecutive days.
- For any future rate-limit event, agent output is preserved (artifact or fallback issue) — not silently lost.

### Confidence & unknowns

- **Confidence: high** that the two listed runs failed due to installation-token rate-limit exhaustion (explicit `0/15000` headroom and PR/fallback both reject with the same error).
- **Unknown:** the agentic-workflows MCP `logs` tool returned only 15 runs for the 6h window, while `gh api` reports 30+ workflow run failures (most non-agentic CI). There may be additional agentic runs in the same burst that were filtered out — worth re-running this investigation with broader filters once rate limit recovers.
- **Unknown:** whether the jitter fix from #29559 was fully landed; spot-checking a daily cron file would confirm.

### References

- [§25568028328](https://github.com/github/gh-aw/actions/runs/25568028328) — Documentation Unbloat (rate-limit at 17:04 UTC)
- [§25567564480](https://github.com/github/gh-aw/actions/runs/25567564480) — Slide Deck Maintainer (rate-limit at 16:46 UTC, headroom already 0)
- [§25574359989](https://github.com/github/gh-aw/actions/runs/25574359989) — this investigation run

Parent: #30961







> Generated by [[aw] Failure Investigator (6h)](https://github.com/github/gh-aw/actions/runs/25574359989/agentic_workflow) · ● 14.6M · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Faw-failure-investigator%22&type=issues)
> - [x] expires  on May 15, 2026, 7:22 PM UTC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[aw-failures] P0 recurrence: GitHub App installation rate-limit exhaustion blocks safe_outputs (2026-05-08 ~16:46–17:04 UTC) #31079

Executive Summary

Failure clusters (rate-limit cluster only)

Evidence

Existing issue correlation

Proposed fix roadmap

Verification / success criteria

Confidence & unknowns

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Workflow	Run	Started (UTC)	Outcome
Documentation Unbloat	§25568028328	16:51	`safe_outputs` failed: `create_pull_request` rate-limited, `fallback-as-issue` disabled, work lost
Slide Deck Maintainer	§25567564480	16:41	`safe_outputs` failed: PR creation and fallback issue both rate-limited, work lost

[aw-failures] P0 recurrence: GitHub App installation rate-limit exhaustion blocks safe_outputs (2026-05-08 ~16:46–17:04 UTC) #31079

Description

Executive Summary

Failure clusters (rate-limit cluster only)

Evidence

Existing issue correlation

Proposed fix roadmap

Verification / success criteria

Confidence & unknowns

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions