Goal
Let the ClawBench harness load + execute tasks from oriyor/assistantbench — "Can Web Agents Solve Realistic and Time-Consuming Tasks?" Live-web tasks intentionally chosen to be long-horizon (e.g. multi-step research, comparison shopping); HF dataset at AssistantBench/AssistantBench.
Scope
- Task loader —
clawbench.corpus.adapters.assistantbench ingests their HF dataset; normalizes their (task, answer_type, gold_answer, reference_links) schema into our task format.
- Answer-grading — their scoring is answer-string-based (with answer-type rubric for numeric/date/list), not HTTP-interception. Surface their grader as a third stage in
run-meta.json (after our Stage 1 + Stage 2) so adopters who care about answer-correctness alongside endpoint-correctness see both.
- Time-limit handling — their tasks are designed long-horizon (many minutes). Honor each task's recommended
max_steps instead of our default 30-min wallclock.
- CLI —
clawbench run --corpus assistantbench --model <m> end-to-end.
Why now
AssistantBench targets long-horizon live web research — a slice ClawBench V1+V2 doesn't directly cover. Hosting it under our harness extends the difficulty curve (most V2 tasks are ≤5 min; AssistantBench tasks routinely exceed 15 min) and gives us a third independent metric (answer-correctness vs interception vs LLM payload judge).
Acceptance
Goal
Let the ClawBench harness load + execute tasks from oriyor/assistantbench — "Can Web Agents Solve Realistic and Time-Consuming Tasks?" Live-web tasks intentionally chosen to be long-horizon (e.g. multi-step research, comparison shopping); HF dataset at AssistantBench/AssistantBench.
Scope
clawbench.corpus.adapters.assistantbenchingests their HF dataset; normalizes their(task, answer_type, gold_answer, reference_links)schema into our task format.run-meta.json(after our Stage 1 + Stage 2) so adopters who care about answer-correctness alongside endpoint-correctness see both.max_stepsinstead of our default 30-min wallclock.clawbench run --corpus assistantbench --model <m>end-to-end.Why now
AssistantBench targets long-horizon live web research — a slice ClawBench V1+V2 doesn't directly cover. Hosting it under our harness extends the difficulty curve (most V2 tasks are ≤5 min; AssistantBench tasks routinely exceed 15 min) and gives us a third independent metric (answer-correctness vs interception vs LLM payload judge).
Acceptance
clawbench run --corpus assistantbench --limit 5 --model <m>produces 5-layer bundlesrun-meta.jsonincludesassistantbench_answer_scorealongside ourintercepted/judge_matcheval/eval_answer.pyoutputs within ±2pp on a sampled subseteval/adapters/assistantbench.md