feat: adapter — run OSWorld (OS-level agent eval, macOS/Linux/Windows) under the ClawBench harness

## Goal

Let the ClawBench harness load + execute tasks from **[xlang-ai/OSWorld](https://github.com/xlang-ai/OSWorld)** — OS-level agent benchmark covering macOS / Linux / Windows desktop tasks (file management, terminal commands, productivity apps, browser-in-a-VM). Different env scope from ClawBench (full OS vs browser), but the trace bundle + scoring pipeline pattern transfers cleanly.

## Scope

- **VM provisioning** — OSWorld ships VM disk images per OS. Package as an opt-in container/VM layer alongside our existing harness containers (`Dockerfile.openclaw` pattern); `clawbench-osworld-runner` orchestrates VM boot per task.
- **Task loader** — `clawbench.corpus.adapters.osworld` ingests their task JSONs (`evaluation_examples/`); maps their `instruction + setup_steps + post_steps + evaluator` schema onto our task schema.
- **Deterministic scoring passthrough** — their `metric` field references Python evaluator functions per task; invoke and emit results in `run-meta.json`. **Skip** our LLM judge for this corpus (OSWorld tasks have rule-based evaluators, similar to ClawMark).
- **Trace adaptation** — VM screen recording → `recording.mp4`; their `step` logs → `actions.jsonl`; we keep our standard 5-layer schema.
- **CLI** — `clawbench run --corpus osworld --os ubuntu --model <m>` end-to-end.

## Why now

Browser-only agent benchmarks are converging; the *next* differentiation axis is **whether agents can leave the browser** (open a terminal, use a file manager, drive a desktop app). Hosting OSWorld under our harness lets us tell a unified "browser + OS" story without forcing users to install two separate evaluation stacks.

## Acceptance

- [ ] `clawbench run --corpus osworld --os ubuntu --limit 3 --model claude-opus-4-7` boots the VM, runs 3 tasks, produces 5-layer trace bundles
- [ ] Deterministic scores match upstream `mm_agents/eval.py` exactly on a sampled subset
- [ ] Docs: `eval/adapters/osworld.md` + VM-provisioning README

## Out of scope

- Reimplementing their VM provisioning from scratch (defer to their snapshots).
- Cross-OS execution in a single batch (one OS per `clawbench run` invocation is fine).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: adapter — run OSWorld (OS-level agent eval, macOS/Linux/Windows) under the ClawBench harness #187

Goal

Scope

Why now

Acceptance

Out of scope

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat: adapter — run OSWorld (OS-level agent eval, macOS/Linux/Windows) under the ClawBench harness #187

Description

Goal

Scope

Why now

Acceptance

Out of scope

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions