Integrate AstaBench for Agent Evaluation

## Overview

Integrate [AstaBench](https://github.com/allenai/asta-bench) from AllenAI to evaluate and benchmark agent performance.

## Why AstaBench

- Open source agent evaluation framework (built on InspectAI)
- Not limited to scientific research - works for any agents
- Provides baseline comparisons
- Tracks cost efficiency alongside quality

## GitHub Repos

- [allenai/asta-bench](https://github.com/allenai/asta-bench) - Main evaluation framework
- [allenai/agent-baselines](https://github.com/allenai/agent-baselines) - Reference implementations
- [allenai/agent-eval](https://github.com/allenai/agent-eval) - Eval toolkit with cost tracking

## Custom Benchmarks to Create

| Benchmark | What It Measures |
|-----------|------------------|
| **Handoff Latency** | Time from Senku delegation → Yusuke completion |
| **Context Preservation** | Does handoff lose important context? |
| **Parallel Efficiency** | Do 3 agents finish 3x faster than 1? |
| **Error Recovery** | Can agents recover when one fails? |
| **Cost Efficiency** | Tokens used vs task complexity |

## Acceptance Criteria

- [ ] Install AstaBench: `pip install astabench`
- [ ] Create custom benchmark suite for handoffs
- [ ] Create benchmark for context preservation
- [ ] Create benchmark for parallel efficiency
- [ ] Run baseline evaluation on current agent setup
- [ ] Document results and improvement areas
- [ ] Integrate with CI for regression testing

## Example Usage

```bash
# Run evaluation
astabench eval --agent senku --benchmark handoff-latency

# View results
astabench view results/
```

## Assigned Agent

@Quinn - Testing and evaluation is your domain

## Related

- Langfuse integration (for real-time monitoring)
- NATS KV (for memory that benchmarks test)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate AstaBench for Agent Evaluation #12

Overview

Why AstaBench

GitHub Repos

Custom Benchmarks to Create

Acceptance Criteria

Example Usage

Assigned Agent

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Benchmark	What It Measures
Handoff Latency	Time from Senku delegation → Yusuke completion
Context Preservation	Does handoff lose important context?
Parallel Efficiency	Do 3 agents finish 3x faster than 1?
Error Recovery	Can agents recover when one fails?
Cost Efficiency	Tokens used vs task complexity

Integrate AstaBench for Agent Evaluation #12

Description

Overview

Why AstaBench

GitHub Repos

Custom Benchmarks to Create

Acceptance Criteria

Example Usage

Assigned Agent

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions