Skip to content

[Harness] Code-execution agent: high failure rate (command loops, unreported output, sandbox/dependency dead-ends) #4119

Description

@senamakel

Summary

In large-scale aggressive fuzz testing with our internal agent-simulation harness, the code-execution agent is the worst-performing high-volume agent. Its outcomes are bimodal — it either completes cleanly or fails hard — and a large share of runs fall in the "fails hard" mode.

The failures cluster into a few concrete, fixable patterns: it re-runs the same command many times without progress, never surfaces the command output, hits a sandbox / dependency wall with no resolution, and ends the turn on a tool call instead of a summary.

Aggregate signal (fuzz-testing batch)

Across a large simulated batch of code-execution runs:

  • ~51% of runs were problematic (low score) — roughly 2× the rate of the other high-volume agents.
  • Outcome split was bimodal: ~29% clear failures, ~23% clean successes, relatively few in the middle.
  • Among the failures, the dominant gap themes were: repeated/looping commands (~28%), no final answer/summary (~21%), policy/sandbox block with no resolution (~15%), and dependency-install failures (~7%).

Failure patterns (synthetic examples)

1. Re-runs the same command repeatedly, never reports output

goal: report the row count of a data file
→ run("python count_rows.py")      # output not surfaced to the user
→ run("python count_rows.py")      # identical command
→ run("python count_rows.py")      # repeats ~10×, result never reported
(turn ends — user never sees the count)

2. Sandbox / dependency wall, no resolution or workaround

→ run("pip install <package>")     # blocked: sandbox / no network
← blocked
→ run("pip install <package>")     # same block, retried
(stops; no "blocked because X — here's an alternative / how to enable it")

3. Turn ends on a tool call instead of a summary

→ run("build")        → ok
→ run("test")         → ok
(last item is a tool call; no closing message — user gets no answer)

4. Doesn't use the code-navigation entry point first

goal: locate where a symbol is defined
→ read_file("a.py") → read_file("b.py") → read_file("c.py")   # blind scanning
(never calls codegraph_search / the intended navigation tool to jump directly)

Manifestation of existing issues

This agent is where several tracked failure classes concentrate:

Suggested fixes / acceptance


Surfaced by our internal agent-simulation harness during large-scale, aggressive fuzz testing of agent behaviors. All examples above are synthetic and contain no real data.

Metadata

Metadata

Assignees

Labels

agent-reliabilityAgent reliability / behaviorharnessAgent harness / orchestration

Type

No type
No fields configured for issues without a type.

Projects

Status
Todo

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions