Skip to content

feat: adapter — run AssistantBench (realistic time-consuming live-web tasks) under the ClawBench harness #188

@reacher-z

Description

@reacher-z

Goal

Let the ClawBench harness load + execute tasks from oriyor/assistantbench — "Can Web Agents Solve Realistic and Time-Consuming Tasks?" Live-web tasks intentionally chosen to be long-horizon (e.g. multi-step research, comparison shopping); HF dataset at AssistantBench/AssistantBench.

Scope

  • Task loaderclawbench.corpus.adapters.assistantbench ingests their HF dataset; normalizes their (task, answer_type, gold_answer, reference_links) schema into our task format.
  • Answer-grading — their scoring is answer-string-based (with answer-type rubric for numeric/date/list), not HTTP-interception. Surface their grader as a third stage in run-meta.json (after our Stage 1 + Stage 2) so adopters who care about answer-correctness alongside endpoint-correctness see both.
  • Time-limit handling — their tasks are designed long-horizon (many minutes). Honor each task's recommended max_steps instead of our default 30-min wallclock.
  • CLIclawbench run --corpus assistantbench --model <m> end-to-end.

Why now

AssistantBench targets long-horizon live web research — a slice ClawBench V1+V2 doesn't directly cover. Hosting it under our harness extends the difficulty curve (most V2 tasks are ≤5 min; AssistantBench tasks routinely exceed 15 min) and gives us a third independent metric (answer-correctness vs interception vs LLM payload judge).

Acceptance

  • clawbench run --corpus assistantbench --limit 5 --model <m> produces 5-layer bundles
  • run-meta.json includes assistantbench_answer_score alongside our intercepted/judge_match
  • Match upstream eval/eval_answer.py outputs within ±2pp on a sampled subset
  • Docs: eval/adapters/assistantbench.md

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions