[Harness] Code-execution agent: high failure rate (command loops, unreported output, sandbox/dependency dead-ends)

## Summary

In large-scale aggressive fuzz testing with our internal agent-simulation harness, the **code-execution agent is the worst-performing high-volume agent**. Its outcomes are **bimodal** — it either completes cleanly or fails hard — and a large share of runs fall in the "fails hard" mode.

The failures cluster into a few concrete, fixable patterns: it **re-runs the same command many times without progress**, **never surfaces the command output**, hits a **sandbox / dependency wall with no resolution**, and **ends the turn on a tool call instead of a summary**.

## Aggregate signal (fuzz-testing batch)

Across a large simulated batch of code-execution runs:

- **~51% of runs were problematic** (low score) — roughly **2× the rate** of the other high-volume agents.
- Outcome split was bimodal: ~29% clear failures, ~23% clean successes, relatively few in the middle.
- Among the failures, the dominant gap themes were: repeated/looping commands (**~28%**), no final answer/summary (**~21%**), policy/sandbox block with no resolution (**~15%**), and dependency-install failures (**~7%**).

## Failure patterns (synthetic examples)

**1. Re-runs the same command repeatedly, never reports output**

```
goal: report the row count of a data file
→ run("python count_rows.py")      # output not surfaced to the user
→ run("python count_rows.py")      # identical command
→ run("python count_rows.py")      # repeats ~10×, result never reported
(turn ends — user never sees the count)
```

**2. Sandbox / dependency wall, no resolution or workaround**

```
→ run("pip install <package>")     # blocked: sandbox / no network
← blocked
→ run("pip install <package>")     # same block, retried
(stops; no "blocked because X — here's an alternative / how to enable it")
```

**3. Turn ends on a tool call instead of a summary**

```
→ run("build")        → ok
→ run("test")         → ok
(last item is a tool call; no closing message — user gets no answer)
```

**4. Doesn't use the code-navigation entry point first**

```
goal: locate where a symbol is defined
→ read_file("a.py") → read_file("b.py") → read_file("c.py")   # blind scanning
(never calls codegraph_search / the intended navigation tool to jump directly)
```

## Manifestation of existing issues

This agent is where several tracked failure classes concentrate:

- Looping with no progress — #4095
- Runs ending without resolution / final summary — #4097
- Policy-blocked actions with no message or workaround — #4094

## Suggested fixes / acceptance

- **Always surface command output** (stdout/stderr/exit code) back to the model and into the final message.
- **Loop guard** on repeated identical commands (see #4088): after K repeats, stop and report instead of re-running.
- **Sandbox/dependency blocks** return a clear reason + the permitted alternative (e.g. preinstalled packages, offline mode), and the agent relays it (see #4094).
- **Terminal-step contract**: a code-execution turn must end with a user-facing summary of what ran and what resulted — never on a bare tool call (see #4093).
- **Navigation-first**: prefer the code-navigation tool as the first step before blind file reads.
- Harness regressions for: no-unreported-output, no >K identical-command loops, always-final-summary.

---
_Surfaced by our internal agent-simulation harness during large-scale, aggressive fuzz testing of agent behaviors. All examples above are synthetic and contain no real data._


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Harness] Code-execution agent: high failure rate (command loops, unreported output, sandbox/dependency dead-ends) #4119

Summary

Aggregate signal (fuzz-testing batch)

Failure patterns (synthetic examples)

Manifestation of existing issues

Suggested fixes / acceptance

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Harness] Code-execution agent: high failure rate (command loops, unreported output, sandbox/dependency dead-ends) #4119

Description

Summary

Aggregate signal (fuzz-testing batch)

Failure patterns (synthetic examples)

Manifestation of existing issues

Suggested fixes / acceptance

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions