Goal
Let the ClawBench harness load + execute tasks from xlang-ai/OSWorld — OS-level agent benchmark covering macOS / Linux / Windows desktop tasks (file management, terminal commands, productivity apps, browser-in-a-VM). Different env scope from ClawBench (full OS vs browser), but the trace bundle + scoring pipeline pattern transfers cleanly.
Scope
- VM provisioning — OSWorld ships VM disk images per OS. Package as an opt-in container/VM layer alongside our existing harness containers (
Dockerfile.openclaw pattern); clawbench-osworld-runner orchestrates VM boot per task.
- Task loader —
clawbench.corpus.adapters.osworld ingests their task JSONs (evaluation_examples/); maps their instruction + setup_steps + post_steps + evaluator schema onto our task schema.
- Deterministic scoring passthrough — their
metric field references Python evaluator functions per task; invoke and emit results in run-meta.json. Skip our LLM judge for this corpus (OSWorld tasks have rule-based evaluators, similar to ClawMark).
- Trace adaptation — VM screen recording →
recording.mp4; their step logs → actions.jsonl; we keep our standard 5-layer schema.
- CLI —
clawbench run --corpus osworld --os ubuntu --model <m> end-to-end.
Why now
Browser-only agent benchmarks are converging; the next differentiation axis is whether agents can leave the browser (open a terminal, use a file manager, drive a desktop app). Hosting OSWorld under our harness lets us tell a unified "browser + OS" story without forcing users to install two separate evaluation stacks.
Acceptance
Out of scope
- Reimplementing their VM provisioning from scratch (defer to their snapshots).
- Cross-OS execution in a single batch (one OS per
clawbench run invocation is fine).
Goal
Let the ClawBench harness load + execute tasks from xlang-ai/OSWorld — OS-level agent benchmark covering macOS / Linux / Windows desktop tasks (file management, terminal commands, productivity apps, browser-in-a-VM). Different env scope from ClawBench (full OS vs browser), but the trace bundle + scoring pipeline pattern transfers cleanly.
Scope
Dockerfile.openclawpattern);clawbench-osworld-runnerorchestrates VM boot per task.clawbench.corpus.adapters.osworldingests their task JSONs (evaluation_examples/); maps theirinstruction + setup_steps + post_steps + evaluatorschema onto our task schema.metricfield references Python evaluator functions per task; invoke and emit results inrun-meta.json. Skip our LLM judge for this corpus (OSWorld tasks have rule-based evaluators, similar to ClawMark).recording.mp4; theirsteplogs →actions.jsonl; we keep our standard 5-layer schema.clawbench run --corpus osworld --os ubuntu --model <m>end-to-end.Why now
Browser-only agent benchmarks are converging; the next differentiation axis is whether agents can leave the browser (open a terminal, use a file manager, drive a desktop app). Hosting OSWorld under our harness lets us tell a unified "browser + OS" story without forcing users to install two separate evaluation stacks.
Acceptance
clawbench run --corpus osworld --os ubuntu --limit 3 --model claude-opus-4-7boots the VM, runs 3 tasks, produces 5-layer trace bundlesmm_agents/eval.pyexactly on a sampled subseteval/adapters/osworld.md+ VM-provisioning READMEOut of scope
clawbench runinvocation is fine).