Skip to content

feat: adapter — run WebVoyager (live-website, screenshot+LLM-judge browser agent) under the ClawBench harness #190

@reacher-z

Description

@reacher-z

Goal

Let the ClawBench harness load + execute tasks from MinorJerry/WebVoyager (1.1k★) — the closest scope-peer of ClawBench: live websites, multimodal browser agent, screenshot+LLM-judge scoring. Paper: "WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models."

Hosting their tasks under our 5-layer trace pipeline lets the community directly compare two scoring paradigms on the same agent runs: their screenshot judge vs our HTTP-interception + payload judge.

Scope

  • Task loaderclawbench.corpus.adapters.webvoyager ingests their data/WebVoyager_data.jsonl task list (URL + natural-language goal + reference answer); normalizes into our task schema.
  • Scoring passthrough — wire to their screenshot-judge prompt so we can emit both scores (ours: interception+payload judge; theirs: screenshot judge). Side-by-side comparison surfaces what each metric catches/misses.
  • Trace bundle conformance — capture recording.mp4, actions.jsonl, agent-messages.jsonl, requests.jsonl, interception.json, run-meta.json per task (already standard).
  • CLIclawbench run --corpus webvoyager --model <m> end-to-end.

Why now

WebVoyager is the most-cited live-website agent benchmark. A clean comparison ("WebVoyager screenshot-judge 59% vs ClawBench interception-judge 54.7%") would meaningfully strengthen our /compare page (#180) and the related-work narrative.

Acceptance

  • clawbench run --corpus webvoyager --limit 5 --model claude-opus-4-7 produces 5-layer trace bundles per task
  • Both scoring outputs in run-meta.json: webvoyager_judge_score and our intercepted/judge_match
  • Score within ±3pp of the upstream gpt-4-1106-preview-runs.zip reproduction on a sampled subset
  • Docs: eval/adapters/webvoyager.md walkthrough

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions