feat(harness): stop identical-call loops and hand back stuck sub-agents as incomplete#4230
feat(harness): stop identical-call loops and hand back stuck sub-agents as incomplete#4230sanil-23 wants to merge 5 commits into
Conversation
…ts as incomplete Two agent-reliability fixes surfaced by fuzz testing: - tinyhumansai#4088 (tinyhumansai#4095): add RepeatCallGuard, a circuit breaker that halts a turn after 3 consecutive identical (tool, args) calls regardless of success. Closes the gap RepeatFailureGuard (resets on success) and RepeatOutputGuard (also hashes narration) both miss. Resets on any distinct intervening call, so read->write->read and varied-arg enumeration never trip it. - tinyhumansai#4091 (tinyhumansai#4096): add a TurnStop discriminator (Final/Halted/Cap/EarlyExit) and SubagentRunStatus::Incomplete so a stuck or capped sub-agent hands the orchestrator a [SUBAGENT_INCOMPLETE] envelope (partial progress + blocker) instead of a success-framed result it could narrate as done or re-spawn. All delegation tools and status consumers updated; one orchestrator-prompt rule added to relay incomplete results rather than re-run unchanged. Rust core only; no Tauri/frontend/dependency changes. Tests: repeat-call guard units, engine TurnStop Halted/Cap tests, and an end-to-end stuck-subagent Incomplete test.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughThe turn engine now emits explicit stop states, including a repeat-call halt. Subagent runner and orchestration code propagate incomplete outcomes with structured reasons, partial output, and updated handling across spawned, continued, and dispatched runs. ChangesTurn stop-state propagation
Sequence Diagram(s)sequenceDiagram
participant run_turn_engine
participant RepeatCallGuard
participant TurnEngineOutcome
run_turn_engine->>RepeatCallGuard: record(canonical (tool, args) batch)
RepeatCallGuard-->>run_turn_engine: halt summary after REPEAT_CALL_THRESHOLD
run_turn_engine->>TurnEngineOutcome: stop = Halted
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches⚔️ Resolve merge conflicts
Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 935e995445
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/openhuman/agent_orchestration/tools/continue_subagent.rs`:
- Around line 315-359: The Incomplete branch in continue_subagent’s
SubagentRunStatus handling leaves a stale checkpoint file behind after the
resumed run stops short again. Mirror the cleanup done in the Completed arm by
adding a best-effort remove_file on checkpoint_path in this Incomplete case,
keeping the checkpoint lifecycle consistent in continue_subagent.
In `@src/openhuman/agent/harness/engine/core_tests.rs`:
- Around line 665-674: The repeated-call test in RepeatedCallProvider::chat is
not isolating the repeat-CALL guard because the narration text is constant and
can also trip the repeat-OUTPUT guard. Add a counter-backed varying narration to
RepeatedCallProvider, mirroring VariedCallProvider, so each ChatResponse text
changes while tool_calls stays identical; this makes the test’s assertion on the
call breaker accurate and keeps the comment aligned with behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f69aa420-dbb1-47e3-a2c8-65c7cb287175
📒 Files selected for processing (18)
src/openhuman/agent/harness/engine/core.rssrc/openhuman/agent/harness/engine/core_tests.rssrc/openhuman/agent/harness/engine/mod.rssrc/openhuman/agent/harness/session/turn/core.rssrc/openhuman/agent/harness/subagent_runner/ops/loop_.rssrc/openhuman/agent/harness/subagent_runner/ops/runner.rssrc/openhuman/agent/harness/subagent_runner/ops_tests.rssrc/openhuman/agent/harness/subagent_runner/types.rssrc/openhuman/agent/harness/tool_loop.rssrc/openhuman/agent/harness/tool_loop_tests.rssrc/openhuman/agent/prompts/sections.rssrc/openhuman/agent_memory/tools.rssrc/openhuman/agent_orchestration/subagent_sessions/types.rssrc/openhuman/agent_orchestration/tools/agent_prepare_context.rssrc/openhuman/agent_orchestration/tools/continue_subagent.rssrc/openhuman/agent_orchestration/tools/dispatch.rssrc/openhuman/agent_orchestration/tools/spawn_async_subagent.rssrc/openhuman/agent_orchestration/tools/spawn_subagent.rs
- continue_subagent: remove the stale AwaitingUser checkpoint file in the Incomplete arm too (best-effort), mirroring the Completed arm, so a resumed run that stops short again doesn't orphan the checkpoint on disk. - core_tests: vary the narration per call in RepeatedCallProvider so the repeat-OUTPUT guard keeps resetting and only the repeat-CALL guard can trip — genuinely isolating the call breaker (the prior constant narration let the output guard accumulate too, just firing later).
… it on success Addresses Codex P1 on tinyhumansai#4230: the repeat-call breaker halted any identical (tool, args) batch before dispatch, including polling tools like wait_subagent whose contract is to be re-invoked with identical args until the work finishes. A task outliving two timeout windows would have its third wait_subagent halted before collecting the result, misreporting a stuck turn. - Add is_repeat_call_exempt (wait_subagent) and reset() on both no-progress guards; exempt all-poll batches from both breakers. - Move the repeat-call breaker post-execution and gate it on success: identical *failing* calls now stay the failure guard's domain (with its per-class thresholds and richer halt message) instead of being preempted at 3. Restores run_tool_call_loop_halts_on_repeated_identical_failure. Tests: polling-exemption unit + reset test, end-to-end polling_tool_is_exempt (5 identical wait_subagent polls finish normally), and the tinyhumansai#4095 success-loop test still halts at the threshold.
…nd-incomplete-handback # Conflicts: # src/openhuman/agent/harness/engine/core.rs # src/openhuman/agent/harness/subagent_runner/ops/runner.rs
…at-call breaker doesn't pre-empt the cap The new repeat-call breaker halts identical (tool,args) loops at 3 reps, so cap/checkpoint tests that reached max_iterations via an identical call now trip the breaker first. Vary the call each turn (distinct args) so these tests still exercise the iteration-cap path: - channels IterativeToolProvider: embed the iteration index as an extra mock_price `step` arg (it already varied narration; the call itself was identical, which the narration-independent repeat-call guard caught). - agent turn_emits_checkpoint_at_max_iterations: vary the echo message by i. - run_tool_call_loop_returns_max_iterations_error: push varied forced responses.
Summary
RepeatCallGuard([Harness] Add circuit breaker for repeated identical tool calls #4088 / [Harness] Agents enter unproductive loops (repeated tool calls, no progress) #4095): a circuit breaker that halts an agent turn after 3 consecutive identical(tool, args)calls, independent of whether each call succeeds — closing the loop-detection gap the two existing guards miss.TurnStopdiscriminator +SubagentRunStatus::Incomplete([Harness] Escalate to orchestrator (Core) after repeated sub-agent failures #4091 / [Harness] Agents don't escalate / hand back control when stuck #4096): a stuck or capped sub-agent now hands the orchestrator a structured[SUBAGENT_INCOMPLETE]envelope (partial progress + blocker) instead of a success-framed result.spawn_subagent,spawn_async_subagent,continue_subagent,dispatch) and status consumers updated; one orchestrator-prompt rule added to relay incomplete results rather than re-run unchanged.Problem
Fuzz testing surfaced two agent-reliability failures:
RepeatFailureGuardresets on every success (tool_loop.rs), andRepeatOutputGuardhashes the assistant narration alongside the call, so trivially varied prose around an identical call resets its streak.SubagentRunStatus::Completedand the delegation tool wrapped it inToolResult::success(...). The partial-progress prose was delivered, but nothing marked it as incomplete — so a weak orchestrator could narrate it as done or silently re-spawn the same delegation.Solution
Loop detection (#4095). New
RepeatCallGuardintool_loop.rs, wired intoengine/core.rsbefore tool execution. It keys only on canonical(tool, args)(sorted-key JSON, so reordered keys can't evade it), is independent of success, and trips atREPEAT_CALL_THRESHOLD = 3. Any distinct intervening call resets the streak, so read→write→read and varied-arg enumeration never trip it.Stuck handback (#4096). New
TurnStopenum (Final/Halted/Cap/EarlyExit) replaces thehit_capbool onTurnEngineOutcome, giving callers an explicit reason a turn ended. The sub-agent runner mapsHalted/Capto a newSubagentRunStatus::Incomplete { reason }, carrying the partial-progress summary asoutput. Delegation tools render a[SUBAGENT_INCOMPLETE]envelope (success + partial progress, not the "nothing happened" error envelope) and aGROUNDING_BODYprompt rule tells the orchestrator to relay it and not re-run unchanged.Design notes:
TurnStopalso back-fills the new repeat-call halt site.Incompletevariant is compiler-enforced — every exhaustivematchonSubagentRunStatuswas updated.Submission Checklist
Halted+CapTurnStoptests, and an end-to-endstuck_subagent_returns_incomplete_statustest.TurnStopmapping, and the end-to-endIncompletepath.cargo-llvm-cov+diff-coverwas not run locally (heavy); the per-tool envelope arms are exercised indirectly and CI'srust-core-coverage/coverage-gateis the authoritative check.## Related— N/A: no matrix rows touched.Closes #NNNin the## Relatedsection.Impact
src/). No Tauri shell, frontend, or dependency changes.Incompletewith its partial progress instead of a success-framed result. Clean finishes and the iteration-cap checkpoint prose are unchanged.Related
REPEAT_CALL_THRESHOLDconfigurable ([Harness] Add circuit breaker for repeated identical tool calls #4088 stretch).AI Authored PR Metadata (required for Codex/Linear PRs)
Linear Issue
Commit & Branch
feat/harness-loop-and-incomplete-handbackValidation Run
pnpm --filter openhuman-app format:check— N/A, no frontend (app) changes.pnpm typecheck— N/A, no TypeScript changes.cargo test --libforengine::core(9),subagent_runner(57),agent::prompts(44),agent_orchestration(226),agent::harness::session(136) — all pass.cargo fmtapplied;cargo check --manifest-path Cargo.tomlclean (0 errors); full lib test build (--no-run) succeeds.app/src-taurichanges.Validation Blocked
command:pnpm rust:check(pre-push hook)error:fails in a fresh worktree (missing vendored CEF libs), unrelated to this changeimpact:pushed with--no-verify; no Tauri/CEF code touched, so the hook's check does not apply to this diff.Behavior Changes
Incompletewith partial progress.Parity Contract
AwaitingUserpauses, and the channel/CLIErrorCheckpointpath are unchanged. Existing 2-identical-call tests stay green (threshold is 3).RepeatCallGuardsits alongside the existingRepeatFailureGuard/RepeatOutputGuardwithout altering them;TurnStopreplaces thehit_capbool with equivalentCapsemantics at its single read site.Duplicate / Superseded PR Handling
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Tests