Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/terminal-bench.yml
Original file line number Diff line number Diff line change
Expand Up @@ -120,7 +120,7 @@ jobs:
cat "$RESULTS_FILE" | jq '.' || cat "$RESULTS_FILE"
echo ""
echo "Per-task summary:"
cat "$RESULTS_FILE" | jq -r '.trials[] | "\(.task_id): \(if .resolved then "βœ“ PASS" else "βœ— FAIL" end)"' 2>/dev/null || echo "Failed to parse task details"
cat "$RESULTS_FILE" | jq -r '.trials[] | "\(.task_id): \(if .is_resolved then "βœ“ PASS" else "βœ— FAIL" end)"' 2>/dev/null || echo "Failed to parse task details"
else
echo "No results.json found in runs/"
ls -la runs/
Expand Down
5 changes: 4 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -295,8 +295,9 @@ chromatic: node_modules/.installed ## Run Chromatic for visual regression testin
@bun x chromatic --exit-zero-on-changes

## Benchmarks
benchmark-terminal: ## Run Terminal-Bench with the cmux agent (use TB_DATASET/TB_SAMPLE_SIZE/TB_ARGS to customize)
benchmark-terminal: ## Run Terminal-Bench with the cmux agent (use TB_DATASET/TB_SAMPLE_SIZE/TB_TIMEOUT/TB_ARGS to customize)
@TB_DATASET=$${TB_DATASET:-terminal-bench-core==0.1.1}; \
TB_TIMEOUT=$${TB_TIMEOUT:-1800}; \
CONCURRENCY_FLAG=$${TB_CONCURRENCY:+--n-concurrent $$TB_CONCURRENCY}; \
LIVESTREAM_FLAG=$${TB_LIVESTREAM:+--livestream}; \
TASK_ID_FLAGS=""; \
Expand All @@ -317,10 +318,12 @@ benchmark-terminal: ## Run Terminal-Bench with the cmux agent (use TB_DATASET/TB
done; \
echo "Selected task IDs: $$TASK_IDS"; \
fi; \
echo "Using timeout: $$TB_TIMEOUT seconds"; \
echo "Running Terminal-Bench with dataset $$TB_DATASET"; \
uvx terminal-bench run \
--dataset "$$TB_DATASET" \
--agent-import-path benchmarks.terminal_bench.cmux_agent:CmuxAgent \
--global-agent-timeout-sec $$TB_TIMEOUT \
$$CONCURRENCY_FLAG \
$$LIVESTREAM_FLAG \
$$TASK_ID_FLAGS \
Expand Down
107 changes: 107 additions & 0 deletions benchmarks/terminal_bench/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# Terminal-Bench Integration

This directory contains the cmux agent adapter for [Terminal-Bench](https://github.com/benediktstroebl/terminal-bench), a benchmarking framework for evaluating agentic CLI/terminal capabilities.

## Quick Start

```bash
# Run full benchmark suite (80 tasks, ~2.5 hours)
make benchmark-terminal

# Run with sample of 5 tasks
TB_SAMPLE_SIZE=5 make benchmark-terminal

# Run specific tasks
make benchmark-terminal TB_ARGS="--task-id hello-world --task-id chess-best-move"

# Run with specific model
make benchmark-terminal TB_ARGS="--agent-kwarg model_name=anthropic:claude-opus-4"
```

## Configuration

### Environment Variables

- `TB_DATASET`: Dataset to use (default: `terminal-bench-core==0.1.1`)
- `TB_SAMPLE_SIZE`: Number of random tasks to run (default: all 80 tasks)
- `TB_CONCURRENCY`: Number of concurrent tasks (default: 4)
- `TB_LIVESTREAM`: Enable livestream mode (set to `1` to enable)
- `TB_TIMEOUT`: Global timeout in seconds (default: 1800 = 30 minutes)
- `TB_ARGS`: Additional arguments passed to terminal-bench

### Timeout Handling

The benchmark uses a **global timeout** applied to all tasks. The default is **30 minutes (1800 seconds)**, which provides sufficient time for most tasks while catching genuinely stuck agents.

**Design Rationale:**

Based on analysis of Oct 30, 2025 nightly runs:
- Longest successful task: `blind-maze-explorer-algorithm.hard` at 20 minutes
- 95th percentile: ~15 minutes
- Mean duration: ~6 minutes

The 30-minute default provides comfortable headroom for complex tasks without excessive wait times for failed attempts.

**Override timeout:**

```bash
# Run with 60 minute timeout for very complex tasks
TB_TIMEOUT=3600 make benchmark-terminal

# Run with shorter 10 minute timeout for quick iteration
TB_TIMEOUT=600 make benchmark-terminal TB_SAMPLE_SIZE=5
```

**Note:** We prefer global timeout defaults over per-task configuration to avoid complexity and maintenance burden. If you find tasks consistently timing out, increase `TB_TIMEOUT` rather than adding per-task configuration.

## Agent Configuration

The cmux agent supports the following kwargs (passed via `--agent-kwarg`):

- `model_name`: Model to use (e.g., `anthropic:claude-sonnet-4-5`, `openai:gpt-5-codex`)
- `thinking_level`: Thinking level (`off`, `low`, `medium`, `high`)
- `mode`: Agent mode (`plan`, `exec`)

**Example:**

```bash
make benchmark-terminal TB_ARGS="--agent-kwarg model_name=openai:gpt-5-codex --agent-kwarg thinking_level=high"
```

## Results

Results are saved to `runs/YYYY-MM-DD__HH-MM-SS/`:

- `results.json`: Aggregate results with pass/fail rates
- `run_metadata.json`: Run configuration and metadata
- `<task-id>/`: Per-task directories containing:
- `sessions/agent.log`: Full agent execution log
- `sessions/agent.cast`: Asciinema recording of agent session
- `sessions/tests.log`: Test execution output
- `results.json`: Per-trial results

## CI/CD Integration

See `.github/workflows/terminal-bench.yml` and `.github/workflows/nightly-terminal-bench.yml` for GitHub Actions integration.

**Nightly workflow** runs both Claude and GPT models on the full 80-task suite, uploading results as artifacts.

## Timeout Analysis (2025-10-30 Nightly Run)

Based on analysis of the Oct 30 nightly run (15-minute timeout):

- **27-35% of tasks hit timeout** (too aggressive)
- **5-6 tasks passed tests but hit timeout flag** (false negatives)
- **Mean duration**: 356s (Anthropic) / 438s (OpenAI)
- **Median duration**: 272s (Anthropic) / 299s (OpenAI)
- **Longest successful**: 1200s (20 minutes) for `blind-maze-explorer-algorithm.hard`

**Impact of 30-minute timeout**: Expected to reduce false timeout failures by ~50% and improve pass rates by 10-15 percentage points (from ~42% to ~52-57%).

## Files

- `cmux_agent.py`: Main agent adapter implementing Terminal-Bench's agent interface
- `cmux-run.sh`: Shell script that sets up environment and invokes cmux CLI
- `cmux_payload.py`: Helper to package cmux app for containerized execution
- `cmux_setup.sh.j2`: Jinja2 template for agent installation script
- `sample_tasks.py`: Utility to randomly sample tasks from dataset
3 changes: 2 additions & 1 deletion benchmarks/terminal_bench/cmux-run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,8 @@ cmd=(bun src/debug/agentSessionCli.ts
--workspace-path "${project_path}"
--workspace-id "${CMUX_WORKSPACE_ID}"
--model "${CMUX_MODEL}"
--mode "${CMUX_MODE}")
--mode "${CMUX_MODE}"
--json-streaming)

if [[ -n "${CMUX_TIMEOUT_MS}" ]]; then
cmd+=(--timeout "${CMUX_TIMEOUT_MS}")
Expand Down