Skip to content

[aw-failures] Fix: claude_harness.cjs retries error_max_turns exits via --continue, always failing (3x wasted retries per DDG complex-PR run) #29600

@github-actions

Description

@github-actions

Problem

claude_harness.cjs misclassifies a clean error_max_turns exit (exit code 1) as a transient overloaded/rate-limit API error. This triggers 3 unnecessary --continue retries, each of which fails immediately because no deferred tool marker exists in a session that ended due to max_turns.

Evidence

From run §25228247320 (Design Decision Gate, PR #29581 copilot/review-claude-agent-driver, 2026-05-01):

Agent exit record (attempt 1):

{"type":"result","subtype":"error_max_turns","is_error":true,"num_turns":13,
 "terminal_reason":"max_turns","errors":["Reached maximum number of turns (12)"]}

Harness misclassification and retry cascade:

attempt 1: process closed exitCode=1 duration=1m28s
attempt 1: overloaded_error (transient) — will retry with --continue (attempt 2/4)
retry 1/3: sleeping 5000ms before next attempt

attempt 2: process closed exitCode=1 duration=0s
Error: No deferred tool marker found in the resumed session.
attempt 2: partial execution — will retry with --continue (attempt 3/4)
retry 2/3: sleeping 10000ms before next attempt

attempt 3: same error — retry with --continue (attempt 4/4)
retry 3/3: sleeping 20000ms before next attempt

attempt 4: same error — all 3 retries exhausted (exitCode=1)
Total harness duration: 2m5s

Total wasted overhead: 35s of retry sleep + ~3 instant-fail continuation attempts ≈ 2m5s total.

Root Cause

The harness checks isOverloadedError and isRateLimitError based on the exit code (1) alone, without inspecting whether the error_max_turns subtype was in the agent's stdout. As a result, any max_turns exit gets incorrectly treated as a transient API failure.

--continue on a max_turns session always fails because the session ended cleanly (not deferred mid-tool-call); the "deferred tool marker" that --continue looks for was never written.

Affected Runs

This run is also the same scenario tracked in #29414 (DDG hits max_turns on complex PRs + comment blocked by security policy). The harness retry loop is an additional compounding factor that prolongs the overall run time from ~4m to ~13m.

Proposed Remediation

In claude_harness.cjs, before triggering the --continue retry path, inspect the agent's stdout for "subtype":"error_max_turns". If found:

  1. Do not retry — the session ended deterministically; --continue cannot recover it.
  2. Exit with the correct code — surface the max_turns failure directly rather than masking it as a transient error.
// Pseudocode guard
if (stdout.includes('"subtype":"error_max_turns"')) {
  log('max_turns exit — not retriable via --continue');
  process.exit(1);
}

Success Criteria

  • DDG fails on complex PRs show exit in ~4–5m (agent time) instead of ~13m
  • No "No deferred tool marker found" entries appear in harness logs for max_turns exits
  • Retry attempts are only triggered for true transient API errors (overloaded, rate-limit)

Related

Generated by [aw] Failure Investigator (6h) · ● 755.3K ·

  • expires on May 8, 2026, 7:24 PM UTC

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions