[aw-failures] Fix: claude_harness.cjs retries error_max_turns exits via --continue, always failing (3x wasted retries per DDG complex-PR run)

### Problem

`claude_harness.cjs` misclassifies a clean `error_max_turns` exit (exit code 1) as a transient overloaded/rate-limit API error. This triggers 3 unnecessary `--continue` retries, each of which fails immediately because no deferred tool marker exists in a session that ended due to max_turns.

### Evidence

From run [§25228247320](https://github.com/github/gh-aw/actions/runs/25228247320) (Design Decision Gate, PR #29581 `copilot/review-claude-agent-driver`, 2026-05-01):

**Agent exit record (attempt 1):**
```json
{"type":"result","subtype":"error_max_turns","is_error":true,"num_turns":13,
 "terminal_reason":"max_turns","errors":["Reached maximum number of turns (12)"]}
```

**Harness misclassification and retry cascade:**
```
attempt 1: process closed exitCode=1 duration=1m28s
attempt 1: overloaded_error (transient) — will retry with --continue (attempt 2/4)
retry 1/3: sleeping 5000ms before next attempt

attempt 2: process closed exitCode=1 duration=0s
Error: No deferred tool marker found in the resumed session.
attempt 2: partial execution — will retry with --continue (attempt 3/4)
retry 2/3: sleeping 10000ms before next attempt

attempt 3: same error — retry with --continue (attempt 4/4)
retry 3/3: sleeping 20000ms before next attempt

attempt 4: same error — all 3 retries exhausted (exitCode=1)
Total harness duration: 2m5s
```

Total wasted overhead: **35s of retry sleep + ~3 instant-fail continuation attempts ≈ 2m5s total**.

### Root Cause

The harness checks `isOverloadedError` and `isRateLimitError` based on the exit code (1) alone, without inspecting whether the `error_max_turns` subtype was in the agent's stdout. As a result, any `max_turns` exit gets incorrectly treated as a transient API failure.

`--continue` on a `max_turns` session always fails because the session ended cleanly (not deferred mid-tool-call); the "deferred tool marker" that `--continue` looks for was never written.

### Affected Runs

- [§25228247320](https://github.com/github/gh-aw/actions/runs/25228247320) — Design Decision Gate, PR #29581, 2026-05-01 (confirmed, 3 retry loops)

This run is also the same scenario tracked in #29414 (DDG hits max_turns on complex PRs + comment blocked by security policy). The harness retry loop is an additional compounding factor that prolongs the overall run time from ~4m to ~13m.

### Proposed Remediation

In `claude_harness.cjs`, before triggering the `--continue` retry path, inspect the agent's stdout for `"subtype":"error_max_turns"`. If found:

1. **Do not retry** — the session ended deterministically; `--continue` cannot recover it.
2. **Exit with the correct code** — surface the `max_turns` failure directly rather than masking it as a transient error.

```js
// Pseudocode guard
if (stdout.includes('"subtype":"error_max_turns"')) {
  log('max_turns exit — not retriable via --continue');
  process.exit(1);
}
```

### Success Criteria

- DDG fails on complex PRs show exit in ~4–5m (agent time) instead of ~13m
- No "No deferred tool marker found" entries appear in harness logs for `max_turns` exits
- Retry attempts are only triggered for true transient API errors (overloaded, rate-limit)

### Related

- #29414 — root tracking issue for DDG max_turns failures on complex PRs
- PR #29581 (`copilot/review-claude-agent-driver`) — the PR that both triggered this failure and contains the harness code where the fix should land







> Generated by [[aw] Failure Investigator (6h)](https://github.com/github/gh-aw/actions/runs/25228741235/agentic_workflow) · ● 755.3K · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Faw-failure-investigator%22&type=issues)
> - [x] expires  on May 8, 2026, 7:24 PM UTC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[aw-failures] Fix: claude_harness.cjs retries error_max_turns exits via --continue, always failing (3x wasted retries per DDG complex-PR run) #29600

Problem

Evidence

Root Cause

Affected Runs

Proposed Remediation

Success Criteria

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[aw-failures] Fix: claude_harness.cjs retries error_max_turns exits via --continue, always failing (3x wasted retries per DDG complex-PR run) #29600

Description

Problem

Evidence

Root Cause

Affected Runs

Proposed Remediation

Success Criteria

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions