feat: adapter — run AssistantBench (realistic time-consuming live-web tasks) under the ClawBench harness

## Goal

Let the ClawBench harness load + execute tasks from **[oriyor/assistantbench](https://github.com/oriyor/assistantbench)** — "Can Web Agents Solve Realistic and Time-Consuming Tasks?" Live-web tasks intentionally chosen to be long-horizon (e.g. multi-step research, comparison shopping); HF dataset at [AssistantBench/AssistantBench](https://huggingface.co/datasets/AssistantBench/AssistantBench).

## Scope

- **Task loader** — `clawbench.corpus.adapters.assistantbench` ingests their HF dataset; normalizes their `(task, answer_type, gold_answer, reference_links)` schema into our task format.
- **Answer-grading** — their scoring is answer-string-based (with answer-type rubric for numeric/date/list), not HTTP-interception. Surface their grader as a third stage in `run-meta.json` (after our Stage 1 + Stage 2) so adopters who care about answer-correctness alongside endpoint-correctness see both.
- **Time-limit handling** — their tasks are designed long-horizon (many minutes). Honor each task's recommended `max_steps` instead of our default 30-min wallclock.
- **CLI** — `clawbench run --corpus assistantbench --model <m>` end-to-end.

## Why now

AssistantBench targets *long-horizon* live web research — a slice ClawBench V1+V2 doesn't directly cover. Hosting it under our harness extends the difficulty curve (most V2 tasks are ≤5 min; AssistantBench tasks routinely exceed 15 min) and gives us a third independent metric (answer-correctness vs interception vs LLM payload judge).

## Acceptance

- [ ] `clawbench run --corpus assistantbench --limit 5 --model <m>` produces 5-layer bundles
- [ ] `run-meta.json` includes `assistantbench_answer_score` alongside our `intercepted`/`judge_match`
- [ ] Match upstream `eval/eval_answer.py` outputs within ±2pp on a sampled subset
- [ ] Docs: `eval/adapters/assistantbench.md`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: adapter — run AssistantBench (realistic time-consuming live-web tasks) under the ClawBench harness #188

Goal

Scope

Why now

Acceptance

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat: adapter — run AssistantBench (realistic time-consuming live-web tasks) under the ClawBench harness #188

Description

Goal

Scope

Why now

Acceptance

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions