feat: adapter — run WebVoyager (live-website, screenshot+LLM-judge browser agent) under the ClawBench harness

## Goal

Let the ClawBench harness load + execute tasks from **[MinorJerry/WebVoyager](https://github.com/MinorJerry/WebVoyager)** (1.1k★) — the closest scope-peer of ClawBench: live websites, multimodal browser agent, screenshot+LLM-judge scoring. Paper: "WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models."

Hosting their tasks under our 5-layer trace pipeline lets the community directly compare two scoring paradigms on the same agent runs: their screenshot judge vs our HTTP-interception + payload judge.

## Scope

- **Task loader** — `clawbench.corpus.adapters.webvoyager` ingests their `data/WebVoyager_data.jsonl` task list (URL + natural-language goal + reference answer); normalizes into our task schema.
- **Scoring passthrough** — wire to their screenshot-judge prompt so we can emit *both* scores (ours: interception+payload judge; theirs: screenshot judge). Side-by-side comparison surfaces what each metric catches/misses.
- **Trace bundle conformance** — capture `recording.mp4`, `actions.jsonl`, `agent-messages.jsonl`, `requests.jsonl`, `interception.json`, `run-meta.json` per task (already standard).
- **CLI** — `clawbench run --corpus webvoyager --model <m>` end-to-end.

## Why now

WebVoyager is the most-cited live-website agent benchmark. A clean comparison ("WebVoyager screenshot-judge 59% vs ClawBench interception-judge 54.7%") would meaningfully strengthen our /compare page (#180) and the related-work narrative.

## Acceptance

- [ ] `clawbench run --corpus webvoyager --limit 5 --model claude-opus-4-7` produces 5-layer trace bundles per task
- [ ] Both scoring outputs in `run-meta.json`: `webvoyager_judge_score` and our `intercepted`/`judge_match`
- [ ] Score within ±3pp of the upstream `gpt-4-1106-preview-runs.zip` reproduction on a sampled subset
- [ ] Docs: `eval/adapters/webvoyager.md` walkthrough

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: adapter — run WebVoyager (live-website, screenshot+LLM-judge browser agent) under the ClawBench harness #190

Goal

Scope

Why now

Acceptance

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat: adapter — run WebVoyager (live-website, screenshot+LLM-judge browser agent) under the ClawBench harness #190

Description

Goal

Scope

Why now

Acceptance

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions