Problem
claude_harness.cjs misclassifies a clean error_max_turns exit (exit code 1) as a transient overloaded/rate-limit API error. This triggers 3 unnecessary --continue retries, each of which fails immediately because no deferred tool marker exists in a session that ended due to max_turns.
Evidence
From run §25228247320 (Design Decision Gate, PR #29581 copilot/review-claude-agent-driver, 2026-05-01):
Agent exit record (attempt 1):
{"type":"result","subtype":"error_max_turns","is_error":true,"num_turns":13,
"terminal_reason":"max_turns","errors":["Reached maximum number of turns (12)"]}
Harness misclassification and retry cascade:
attempt 1: process closed exitCode=1 duration=1m28s
attempt 1: overloaded_error (transient) — will retry with --continue (attempt 2/4)
retry 1/3: sleeping 5000ms before next attempt
attempt 2: process closed exitCode=1 duration=0s
Error: No deferred tool marker found in the resumed session.
attempt 2: partial execution — will retry with --continue (attempt 3/4)
retry 2/3: sleeping 10000ms before next attempt
attempt 3: same error — retry with --continue (attempt 4/4)
retry 3/3: sleeping 20000ms before next attempt
attempt 4: same error — all 3 retries exhausted (exitCode=1)
Total harness duration: 2m5s
Total wasted overhead: 35s of retry sleep + ~3 instant-fail continuation attempts ≈ 2m5s total.
Root Cause
The harness checks isOverloadedError and isRateLimitError based on the exit code (1) alone, without inspecting whether the error_max_turns subtype was in the agent's stdout. As a result, any max_turns exit gets incorrectly treated as a transient API failure.
--continue on a max_turns session always fails because the session ended cleanly (not deferred mid-tool-call); the "deferred tool marker" that --continue looks for was never written.
Affected Runs
This run is also the same scenario tracked in #29414 (DDG hits max_turns on complex PRs + comment blocked by security policy). The harness retry loop is an additional compounding factor that prolongs the overall run time from ~4m to ~13m.
Proposed Remediation
In claude_harness.cjs, before triggering the --continue retry path, inspect the agent's stdout for "subtype":"error_max_turns". If found:
- Do not retry — the session ended deterministically;
--continue cannot recover it.
- Exit with the correct code — surface the
max_turns failure directly rather than masking it as a transient error.
// Pseudocode guard
if (stdout.includes('"subtype":"error_max_turns"')) {
log('max_turns exit — not retriable via --continue');
process.exit(1);
}
Success Criteria
- DDG fails on complex PRs show exit in ~4–5m (agent time) instead of ~13m
- No "No deferred tool marker found" entries appear in harness logs for
max_turns exits
- Retry attempts are only triggered for true transient API errors (overloaded, rate-limit)
Related
Generated by [aw] Failure Investigator (6h) · ● 755.3K · ◷
Problem
claude_harness.cjsmisclassifies a cleanerror_max_turnsexit (exit code 1) as a transient overloaded/rate-limit API error. This triggers 3 unnecessary--continueretries, each of which fails immediately because no deferred tool marker exists in a session that ended due to max_turns.Evidence
From run §25228247320 (Design Decision Gate, PR #29581
copilot/review-claude-agent-driver, 2026-05-01):Agent exit record (attempt 1):
{"type":"result","subtype":"error_max_turns","is_error":true,"num_turns":13, "terminal_reason":"max_turns","errors":["Reached maximum number of turns (12)"]}Harness misclassification and retry cascade:
Total wasted overhead: 35s of retry sleep + ~3 instant-fail continuation attempts ≈ 2m5s total.
Root Cause
The harness checks
isOverloadedErrorandisRateLimitErrorbased on the exit code (1) alone, without inspecting whether theerror_max_turnssubtype was in the agent's stdout. As a result, anymax_turnsexit gets incorrectly treated as a transient API failure.--continueon amax_turnssession always fails because the session ended cleanly (not deferred mid-tool-call); the "deferred tool marker" that--continuelooks for was never written.Affected Runs
This run is also the same scenario tracked in #29414 (DDG hits max_turns on complex PRs + comment blocked by security policy). The harness retry loop is an additional compounding factor that prolongs the overall run time from ~4m to ~13m.
Proposed Remediation
In
claude_harness.cjs, before triggering the--continueretry path, inspect the agent's stdout for"subtype":"error_max_turns". If found:--continuecannot recover it.max_turnsfailure directly rather than masking it as a transient error.Success Criteria
max_turnsexitsRelated
copilot/review-claude-agent-driver) — the PR that both triggered this failure and contains the harness code where the fix should land