Summary
In large-scale aggressive fuzz testing with our internal agent-simulation harness, the code-execution agent is the worst-performing high-volume agent. Its outcomes are bimodal — it either completes cleanly or fails hard — and a large share of runs fall in the "fails hard" mode.
The failures cluster into a few concrete, fixable patterns: it re-runs the same command many times without progress, never surfaces the command output, hits a sandbox / dependency wall with no resolution, and ends the turn on a tool call instead of a summary.
Aggregate signal (fuzz-testing batch)
Across a large simulated batch of code-execution runs:
- ~51% of runs were problematic (low score) — roughly 2× the rate of the other high-volume agents.
- Outcome split was bimodal: ~29% clear failures, ~23% clean successes, relatively few in the middle.
- Among the failures, the dominant gap themes were: repeated/looping commands (~28%), no final answer/summary (~21%), policy/sandbox block with no resolution (~15%), and dependency-install failures (~7%).
Failure patterns (synthetic examples)
1. Re-runs the same command repeatedly, never reports output
goal: report the row count of a data file
→ run("python count_rows.py") # output not surfaced to the user
→ run("python count_rows.py") # identical command
→ run("python count_rows.py") # repeats ~10×, result never reported
(turn ends — user never sees the count)
2. Sandbox / dependency wall, no resolution or workaround
→ run("pip install <package>") # blocked: sandbox / no network
← blocked
→ run("pip install <package>") # same block, retried
(stops; no "blocked because X — here's an alternative / how to enable it")
3. Turn ends on a tool call instead of a summary
→ run("build") → ok
→ run("test") → ok
(last item is a tool call; no closing message — user gets no answer)
4. Doesn't use the code-navigation entry point first
goal: locate where a symbol is defined
→ read_file("a.py") → read_file("b.py") → read_file("c.py") # blind scanning
(never calls codegraph_search / the intended navigation tool to jump directly)
Manifestation of existing issues
This agent is where several tracked failure classes concentrate:
Suggested fixes / acceptance
Surfaced by our internal agent-simulation harness during large-scale, aggressive fuzz testing of agent behaviors. All examples above are synthetic and contain no real data.
Summary
In large-scale aggressive fuzz testing with our internal agent-simulation harness, the code-execution agent is the worst-performing high-volume agent. Its outcomes are bimodal — it either completes cleanly or fails hard — and a large share of runs fall in the "fails hard" mode.
The failures cluster into a few concrete, fixable patterns: it re-runs the same command many times without progress, never surfaces the command output, hits a sandbox / dependency wall with no resolution, and ends the turn on a tool call instead of a summary.
Aggregate signal (fuzz-testing batch)
Across a large simulated batch of code-execution runs:
Failure patterns (synthetic examples)
1. Re-runs the same command repeatedly, never reports output
2. Sandbox / dependency wall, no resolution or workaround
3. Turn ends on a tool call instead of a summary
4. Doesn't use the code-navigation entry point first
Manifestation of existing issues
This agent is where several tracked failure classes concentrate:
Suggested fixes / acceptance
Surfaced by our internal agent-simulation harness during large-scale, aggressive fuzz testing of agent behaviors. All examples above are synthetic and contain no real data.