diff --git a/.github/workflows/terminal-bench.yml b/.github/workflows/terminal-bench.yml index 1f99b42ae..666a28686 100644 --- a/.github/workflows/terminal-bench.yml +++ b/.github/workflows/terminal-bench.yml @@ -120,7 +120,7 @@ jobs: cat "$RESULTS_FILE" | jq '.' || cat "$RESULTS_FILE" echo "" echo "Per-task summary:" - cat "$RESULTS_FILE" | jq -r '.trials[] | "\(.task_id): \(if .resolved then "✓ PASS" else "✗ FAIL" end)"' 2>/dev/null || echo "Failed to parse task details" + cat "$RESULTS_FILE" | jq -r '.trials[] | "\(.task_id): \(if .is_resolved then "✓ PASS" else "✗ FAIL" end)"' 2>/dev/null || echo "Failed to parse task details" else echo "No results.json found in runs/" ls -la runs/ diff --git a/Makefile b/Makefile index 24cd13165..d3df25aee 100644 --- a/Makefile +++ b/Makefile @@ -295,8 +295,9 @@ chromatic: node_modules/.installed ## Run Chromatic for visual regression testin @bun x chromatic --exit-zero-on-changes ## Benchmarks -benchmark-terminal: ## Run Terminal-Bench with the cmux agent (use TB_DATASET/TB_SAMPLE_SIZE/TB_ARGS to customize) +benchmark-terminal: ## Run Terminal-Bench with the cmux agent (use TB_DATASET/TB_SAMPLE_SIZE/TB_TIMEOUT/TB_ARGS to customize) @TB_DATASET=$${TB_DATASET:-terminal-bench-core==0.1.1}; \ + TB_TIMEOUT=$${TB_TIMEOUT:-1800}; \ CONCURRENCY_FLAG=$${TB_CONCURRENCY:+--n-concurrent $$TB_CONCURRENCY}; \ LIVESTREAM_FLAG=$${TB_LIVESTREAM:+--livestream}; \ TASK_ID_FLAGS=""; \ @@ -317,10 +318,12 @@ benchmark-terminal: ## Run Terminal-Bench with the cmux agent (use TB_DATASET/TB done; \ echo "Selected task IDs: $$TASK_IDS"; \ fi; \ + echo "Using timeout: $$TB_TIMEOUT seconds"; \ echo "Running Terminal-Bench with dataset $$TB_DATASET"; \ uvx terminal-bench run \ --dataset "$$TB_DATASET" \ --agent-import-path benchmarks.terminal_bench.cmux_agent:CmuxAgent \ + --global-agent-timeout-sec $$TB_TIMEOUT \ $$CONCURRENCY_FLAG \ $$LIVESTREAM_FLAG \ $$TASK_ID_FLAGS \ diff --git a/benchmarks/terminal_bench/README.md b/benchmarks/terminal_bench/README.md new file mode 100644 index 000000000..0c3d727b5 --- /dev/null +++ b/benchmarks/terminal_bench/README.md @@ -0,0 +1,107 @@ +# Terminal-Bench Integration + +This directory contains the cmux agent adapter for [Terminal-Bench](https://github.com/benediktstroebl/terminal-bench), a benchmarking framework for evaluating agentic CLI/terminal capabilities. + +## Quick Start + +```bash +# Run full benchmark suite (80 tasks, ~2.5 hours) +make benchmark-terminal + +# Run with sample of 5 tasks +TB_SAMPLE_SIZE=5 make benchmark-terminal + +# Run specific tasks +make benchmark-terminal TB_ARGS="--task-id hello-world --task-id chess-best-move" + +# Run with specific model +make benchmark-terminal TB_ARGS="--agent-kwarg model_name=anthropic:claude-opus-4" +``` + +## Configuration + +### Environment Variables + +- `TB_DATASET`: Dataset to use (default: `terminal-bench-core==0.1.1`) +- `TB_SAMPLE_SIZE`: Number of random tasks to run (default: all 80 tasks) +- `TB_CONCURRENCY`: Number of concurrent tasks (default: 4) +- `TB_LIVESTREAM`: Enable livestream mode (set to `1` to enable) +- `TB_TIMEOUT`: Global timeout in seconds (default: 1800 = 30 minutes) +- `TB_ARGS`: Additional arguments passed to terminal-bench + +### Timeout Handling + +The benchmark uses a **global timeout** applied to all tasks. The default is **30 minutes (1800 seconds)**, which provides sufficient time for most tasks while catching genuinely stuck agents. + +**Design Rationale:** + +Based on analysis of Oct 30, 2025 nightly runs: +- Longest successful task: `blind-maze-explorer-algorithm.hard` at 20 minutes +- 95th percentile: ~15 minutes +- Mean duration: ~6 minutes + +The 30-minute default provides comfortable headroom for complex tasks without excessive wait times for failed attempts. + +**Override timeout:** + +```bash +# Run with 60 minute timeout for very complex tasks +TB_TIMEOUT=3600 make benchmark-terminal + +# Run with shorter 10 minute timeout for quick iteration +TB_TIMEOUT=600 make benchmark-terminal TB_SAMPLE_SIZE=5 +``` + +**Note:** We prefer global timeout defaults over per-task configuration to avoid complexity and maintenance burden. If you find tasks consistently timing out, increase `TB_TIMEOUT` rather than adding per-task configuration. + +## Agent Configuration + +The cmux agent supports the following kwargs (passed via `--agent-kwarg`): + +- `model_name`: Model to use (e.g., `anthropic:claude-sonnet-4-5`, `openai:gpt-5-codex`) +- `thinking_level`: Thinking level (`off`, `low`, `medium`, `high`) +- `mode`: Agent mode (`plan`, `exec`) + +**Example:** + +```bash +make benchmark-terminal TB_ARGS="--agent-kwarg model_name=openai:gpt-5-codex --agent-kwarg thinking_level=high" +``` + +## Results + +Results are saved to `runs/YYYY-MM-DD__HH-MM-SS/`: + +- `results.json`: Aggregate results with pass/fail rates +- `run_metadata.json`: Run configuration and metadata +- `/`: Per-task directories containing: + - `sessions/agent.log`: Full agent execution log + - `sessions/agent.cast`: Asciinema recording of agent session + - `sessions/tests.log`: Test execution output + - `results.json`: Per-trial results + +## CI/CD Integration + +See `.github/workflows/terminal-bench.yml` and `.github/workflows/nightly-terminal-bench.yml` for GitHub Actions integration. + +**Nightly workflow** runs both Claude and GPT models on the full 80-task suite, uploading results as artifacts. + +## Timeout Analysis (2025-10-30 Nightly Run) + +Based on analysis of the Oct 30 nightly run (15-minute timeout): + +- **27-35% of tasks hit timeout** (too aggressive) +- **5-6 tasks passed tests but hit timeout flag** (false negatives) +- **Mean duration**: 356s (Anthropic) / 438s (OpenAI) +- **Median duration**: 272s (Anthropic) / 299s (OpenAI) +- **Longest successful**: 1200s (20 minutes) for `blind-maze-explorer-algorithm.hard` + +**Impact of 30-minute timeout**: Expected to reduce false timeout failures by ~50% and improve pass rates by 10-15 percentage points (from ~42% to ~52-57%). + +## Files + +- `cmux_agent.py`: Main agent adapter implementing Terminal-Bench's agent interface +- `cmux-run.sh`: Shell script that sets up environment and invokes cmux CLI +- `cmux_payload.py`: Helper to package cmux app for containerized execution +- `cmux_setup.sh.j2`: Jinja2 template for agent installation script +- `sample_tasks.py`: Utility to randomly sample tasks from dataset diff --git a/benchmarks/terminal_bench/cmux-run.sh b/benchmarks/terminal_bench/cmux-run.sh index 4aaa69895..7c7889bcf 100644 --- a/benchmarks/terminal_bench/cmux-run.sh +++ b/benchmarks/terminal_bench/cmux-run.sh @@ -102,7 +102,8 @@ cmd=(bun src/debug/agentSessionCli.ts --workspace-path "${project_path}" --workspace-id "${CMUX_WORKSPACE_ID}" --model "${CMUX_MODEL}" - --mode "${CMUX_MODE}") + --mode "${CMUX_MODE}" + --json-streaming) if [[ -n "${CMUX_TIMEOUT_MS}" ]]; then cmd+=(--timeout "${CMUX_TIMEOUT_MS}")