-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Overview
Integrate AstaBench from AllenAI to evaluate and benchmark agent performance.
Why AstaBench
- Open source agent evaluation framework (built on InspectAI)
- Not limited to scientific research - works for any agents
- Provides baseline comparisons
- Tracks cost efficiency alongside quality
GitHub Repos
- allenai/asta-bench - Main evaluation framework
- allenai/agent-baselines - Reference implementations
- allenai/agent-eval - Eval toolkit with cost tracking
Custom Benchmarks to Create
| Benchmark | What It Measures |
|---|---|
| Handoff Latency | Time from Senku delegation → Yusuke completion |
| Context Preservation | Does handoff lose important context? |
| Parallel Efficiency | Do 3 agents finish 3x faster than 1? |
| Error Recovery | Can agents recover when one fails? |
| Cost Efficiency | Tokens used vs task complexity |
Acceptance Criteria
- Install AstaBench:
pip install astabench - Create custom benchmark suite for handoffs
- Create benchmark for context preservation
- Create benchmark for parallel efficiency
- Run baseline evaluation on current agent setup
- Document results and improvement areas
- Integrate with CI for regression testing
Example Usage
# Run evaluation
astabench eval --agent senku --benchmark handoff-latency
# View results
astabench view results/Assigned Agent
@quinn - Testing and evaluation is your domain
Related
- Langfuse integration (for real-time monitoring)
- NATS KV (for memory that benchmarks test)
Reactions are currently unavailable