Goal
Let the ClawBench harness load + execute tasks from MinorJerry/WebVoyager (1.1k★) — the closest scope-peer of ClawBench: live websites, multimodal browser agent, screenshot+LLM-judge scoring. Paper: "WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models."
Hosting their tasks under our 5-layer trace pipeline lets the community directly compare two scoring paradigms on the same agent runs: their screenshot judge vs our HTTP-interception + payload judge.
Scope
- Task loader —
clawbench.corpus.adapters.webvoyager ingests their data/WebVoyager_data.jsonl task list (URL + natural-language goal + reference answer); normalizes into our task schema.
- Scoring passthrough — wire to their screenshot-judge prompt so we can emit both scores (ours: interception+payload judge; theirs: screenshot judge). Side-by-side comparison surfaces what each metric catches/misses.
- Trace bundle conformance — capture
recording.mp4, actions.jsonl, agent-messages.jsonl, requests.jsonl, interception.json, run-meta.json per task (already standard).
- CLI —
clawbench run --corpus webvoyager --model <m> end-to-end.
Why now
WebVoyager is the most-cited live-website agent benchmark. A clean comparison ("WebVoyager screenshot-judge 59% vs ClawBench interception-judge 54.7%") would meaningfully strengthen our /compare page (#180) and the related-work narrative.
Acceptance
Goal
Let the ClawBench harness load + execute tasks from MinorJerry/WebVoyager (1.1k★) — the closest scope-peer of ClawBench: live websites, multimodal browser agent, screenshot+LLM-judge scoring. Paper: "WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models."
Hosting their tasks under our 5-layer trace pipeline lets the community directly compare two scoring paradigms on the same agent runs: their screenshot judge vs our HTTP-interception + payload judge.
Scope
clawbench.corpus.adapters.webvoyageringests theirdata/WebVoyager_data.jsonltask list (URL + natural-language goal + reference answer); normalizes into our task schema.recording.mp4,actions.jsonl,agent-messages.jsonl,requests.jsonl,interception.json,run-meta.jsonper task (already standard).clawbench run --corpus webvoyager --model <m>end-to-end.Why now
WebVoyager is the most-cited live-website agent benchmark. A clean comparison ("WebVoyager screenshot-judge 59% vs ClawBench interception-judge 54.7%") would meaningfully strengthen our /compare page (#180) and the related-work narrative.
Acceptance
clawbench run --corpus webvoyager --limit 5 --model claude-opus-4-7produces 5-layer trace bundles per taskrun-meta.json:webvoyager_judge_scoreand ourintercepted/judge_matchgpt-4-1106-preview-runs.zipreproduction on a sampled subseteval/adapters/webvoyager.mdwalkthrough