Skip to content

Integrate AstaBench for Agent Evaluation #12

@wonderwomancode

Description

@wonderwomancode

Overview

Integrate AstaBench from AllenAI to evaluate and benchmark agent performance.

Why AstaBench

  • Open source agent evaluation framework (built on InspectAI)
  • Not limited to scientific research - works for any agents
  • Provides baseline comparisons
  • Tracks cost efficiency alongside quality

GitHub Repos

Custom Benchmarks to Create

Benchmark What It Measures
Handoff Latency Time from Senku delegation → Yusuke completion
Context Preservation Does handoff lose important context?
Parallel Efficiency Do 3 agents finish 3x faster than 1?
Error Recovery Can agents recover when one fails?
Cost Efficiency Tokens used vs task complexity

Acceptance Criteria

  • Install AstaBench: pip install astabench
  • Create custom benchmark suite for handoffs
  • Create benchmark for context preservation
  • Create benchmark for parallel efficiency
  • Run baseline evaluation on current agent setup
  • Document results and improvement areas
  • Integrate with CI for regression testing

Example Usage

# Run evaluation
astabench eval --agent senku --benchmark handoff-latency

# View results
astabench view results/

Assigned Agent

@quinn - Testing and evaluation is your domain

Related

  • Langfuse integration (for real-time monitoring)
  • NATS KV (for memory that benchmarks test)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions