Skip to content

feat: adapter — run OSWorld (OS-level agent eval, macOS/Linux/Windows) under the ClawBench harness #187

@reacher-z

Description

@reacher-z

Goal

Let the ClawBench harness load + execute tasks from xlang-ai/OSWorld — OS-level agent benchmark covering macOS / Linux / Windows desktop tasks (file management, terminal commands, productivity apps, browser-in-a-VM). Different env scope from ClawBench (full OS vs browser), but the trace bundle + scoring pipeline pattern transfers cleanly.

Scope

  • VM provisioning — OSWorld ships VM disk images per OS. Package as an opt-in container/VM layer alongside our existing harness containers (Dockerfile.openclaw pattern); clawbench-osworld-runner orchestrates VM boot per task.
  • Task loaderclawbench.corpus.adapters.osworld ingests their task JSONs (evaluation_examples/); maps their instruction + setup_steps + post_steps + evaluator schema onto our task schema.
  • Deterministic scoring passthrough — their metric field references Python evaluator functions per task; invoke and emit results in run-meta.json. Skip our LLM judge for this corpus (OSWorld tasks have rule-based evaluators, similar to ClawMark).
  • Trace adaptation — VM screen recording → recording.mp4; their step logs → actions.jsonl; we keep our standard 5-layer schema.
  • CLIclawbench run --corpus osworld --os ubuntu --model <m> end-to-end.

Why now

Browser-only agent benchmarks are converging; the next differentiation axis is whether agents can leave the browser (open a terminal, use a file manager, drive a desktop app). Hosting OSWorld under our harness lets us tell a unified "browser + OS" story without forcing users to install two separate evaluation stacks.

Acceptance

  • clawbench run --corpus osworld --os ubuntu --limit 3 --model claude-opus-4-7 boots the VM, runs 3 tasks, produces 5-layer trace bundles
  • Deterministic scores match upstream mm_agents/eval.py exactly on a sampled subset
  • Docs: eval/adapters/osworld.md + VM-provisioning README

Out of scope

  • Reimplementing their VM provisioning from scratch (defer to their snapshots).
  • Cross-OS execution in a single batch (one OS per clawbench run invocation is fine).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions