Skip to content

Latest commit

 

History

History
377 lines (290 loc) · 11.7 KB

File metadata and controls

377 lines (290 loc) · 11.7 KB

Python API

The Rollout/Scene API is the primary way to run agent benchmarks programmatically.

Install

uv tool install benchflow

Quick Start

import asyncio
import benchflow as bf

result = asyncio.run(bf.run("gemini", task_path="tasks/my-task", model="gemini-3.1-flash-lite-preview"))

print(f"Reward: {result.rewards}")
print(f"Tool calls: {result.n_tool_calls}")

Core Types

RolloutConfig

Declarative configuration for a rollout — a sequence of Scenes in a shared sandbox.

from pathlib import Path
from benchflow import RolloutConfig, Scene, Role, Turn

# Single-agent (simplest)
config = RolloutConfig(
    task_path=Path("tasks/my-task"),
    scenes=[Scene.single(agent="gemini", model="gemini-3.1-flash-lite-preview")],
    environment="daytona",
    sandbox_setup_timeout=120,
)

# Multi-scene BYOS (skill-gen → solve)
config = RolloutConfig(
    task_path=Path("tasks/my-task"),
    scenes=[
        Scene(name="prep", roles=[Role("gen", "gemini", "gemini-3.1-flash-lite-preview")],
              turns=[Turn("gen", "Generate a skill for this task...")]),
        Scene(name="solve", roles=[Role("solver", "gemini", "gemini-3.1-flash-lite-preview")],
              turns=[Turn("solver")]),
    ],
    environment="daytona",
    sandbox_setup_timeout=120,
)

Set sandbox_setup_timeout when sandbox user setup needs more than the default 120 seconds. The same field is also available on JobConfig and RuntimeConfig.

Scene

One interaction region — roles take turns executing prompts.

# Single-role shortcut
scene = Scene.single(agent="gemini", model="gemini-3.1-flash-lite-preview")

# Multi-role with turn order (coder-reviewer pattern)
# Agents communicate via outbox: write /app/.outbox/{recipient}.json
# Scheduler reads outbox after each turn, injects into next role's prompt
scene = Scene(
    name="coder-reviewer",
    roles=[
        Role("coder", "gemini", "gemini-3.1-flash-lite-preview"),
        Role("reviewer", "gemini", "gemini-3.1-flash-lite-preview"),
    ],
    turns=[
        Turn("coder"),                    # None prompt = instruction.md
        Turn("reviewer", "Review the code. Write feedback to "
             '/app/.outbox/coder.json as {"to":"coder","content":"..."}'),
        Turn("coder", "Fix the issues."), # reviewer's feedback auto-injected
    ],
)

Rollout

The execution engine — decomposed into independently-callable phases.

from benchflow import Rollout

rollout = await Rollout.create(config)

# Full lifecycle (most common)
result = await rollout.run()

# Manual composition (for custom flows)
await rollout.setup()
await rollout.start()
await rollout.install_agent()
await rollout.connect()
await rollout.execute(prompts=["custom prompt"])
await rollout.disconnect()
await rollout.verify()
await rollout.cleanup()

RuntimeConfig

Runtime-level configuration for the Agent + Environment execution path.

from benchflow.runtime import Agent, Environment, Runtime, RuntimeConfig

config = RuntimeConfig(sandbox_setup_timeout=300)
agent = Agent("gemini", model="gemini-3.1-flash-lite-preview")
env = Environment.from_task("tasks/X", sandbox="daytona")
runtime = Runtime(env, agent, config=config)
result = await runtime.execute()

bf.run()

Convenience function — multiple calling conventions:

import benchflow as bf

# 1. RolloutConfig (full control)
result = await bf.run(config)

# 2. Agent + Environment (0.3 style)
agent = bf.Agent("gemini", model="gemini-3.1-flash-lite-preview")
env = bf.Environment.from_task("tasks/X", sandbox="daytona")
runtime_config = bf.RuntimeConfig(sandbox_setup_timeout=300)
result = await bf.run(agent, env, runtime_config)

# 3. String shortcut (simplest)
result = await bf.run(
    "gemini",
    task_path="tasks/X",
    model="gemini-3.1-flash-lite-preview",
    config=bf.RuntimeConfig(sandbox_setup_timeout=300),
)

Rollout Lifecycle

Rollout.run()
  │
  ├─ setup()          — resolve config, create env object
  ├─ start()          — spin up sandbox, upload task files, start services
  ├─ install_agent()  — install agent binary, credentials, sandbox user
  │                    (sandbox user setup: create non-root user, prepare
  │                     small config/auth dirs, chown the workspace — no
  │                     recursive copy of /root tool trees; agent binaries
  │                     must live on shared prefixes like /usr/local/bin)
  ├─ for scene in scenes:
  │    └─ _run_scene(scene)
  │         ├─ setup /app/.outbox/ — (multi-role scenes only)
  │         └─ for turn in scene.turns:
  │              ├─ read outbox     — inject messages into prompt
  │              ├─ connect_as(role) — open ACP session for this role
  │              ├─ execute(prompts) — send prompts, collect trajectory
  │              └─ disconnect()    — kill agent process, clean up
  ├─ verify()         — run verifier, collect rewards
  └─ cleanup()        — stop sandbox

Key: disconnect() kills the agent process between scenes to prevent context bleed. Each scene gets a fresh agent session.

Multi-Turn vs Multi-Round

Pattern Roles Turns Communication Example
Single-turn 1 1 Baseline benchmark
Multi-turn 1 2+ Same session, sequential prompts Self-review
Multi-round 2+ 2+ Outbox files between roles Coder + Reviewer

Multi-turn = same agent gets multiple prompts. Use when a second pass catches errors (self-review, iterative refinement). The agent keeps its context across turns.

Multi-round = different agents exchange turns. Use when tasks need multiple perspectives (code review, client-advisor). The scheduler reads outbox files and injects messages.

Both use the same API — RolloutConfig with different Scene configurations.

Multi-Agent Patterns

Coder + Reviewer (followup-bench)

config = RolloutConfig(
    task_path=task_path,
    scenes=[Scene(
        roles=[Role("coder", "gemini", "flash"), Role("reviewer", "gemini", "flash")],
        turns=[
            Turn("coder"),
            Turn("reviewer", "Review /app/. Write feedback to /app/.outbox/coder.json"),
            Turn("coder", "Read feedback and fix."),
        ],
    )],
    environment="daytona",
)

Skill Generation + Solve (BYOS)

config = RolloutConfig(
    task_path=task_path,
    scenes=[
        Scene(name="skill-gen",
              roles=[Role("gen", "gemini", "flash")],
              turns=[Turn("gen", "Generate a skill document to /app/generated-skill.md")]),
        Scene(name="solve",
              roles=[Role("solver", "gemini", "flash")],
              turns=[Turn("solver")]),
    ],
    environment="daytona",
)

User-Driven Loops

Use BaseUser or FunctionUser when one agent should run multiple rounds and Python should decide the next prompt from verifier feedback. This is the progressive-disclosure path: the user callback can stop early, read RoundResult after each soft_verify(), and optionally receive the oracle solution during setup() when oracle_access=True.

from pathlib import Path

from benchflow import FunctionUser, RolloutConfig, RoundResult, Scene


def user(round: int, instruction: str, rr: RoundResult | None) -> str | None:
    if round == 0:
        return instruction.splitlines()[0]
    if rr and (rr.rewards or {}).get("reward") == 1.0:
        return None
    return f"Tests failed:\n{rr.verifier_output}\n\nUse the full spec:\n{instruction}"


config = RolloutConfig(
    task_path=Path("tasks/my-task"),
    scenes=[Scene.single(agent="gemini", model="gemini-3.1-flash-lite-preview")],
    user=FunctionUser(user),
    max_user_rounds=3,
    environment="daytona",
)
result = await bf.run(config)

Use multi-role Scenes when another LLM should act as the reviewer or simulated user. Use BaseUser when the loop is deterministic or verifier-driven. See progressive-disclosure.md and docs/examples/scene-patterns.ipynb.

YAML Rollout Configs

from benchflow._utils.yaml_loader import rollout_config_from_yaml

config = rollout_config_from_yaml("rollout.yaml")
result = await bf.run(config)

Registered Agents

Agent Protocol Auth Aliases
gemini ACP GEMINI_API_KEY
claude-agent-acp ACP ANTHROPIC_API_KEY claude
codex-acp ACP OPENAI_API_KEY, CODEX_API_KEY, CODEX_ACCESS_TOKEN, or host login codex
opencode ACP inferred from model/provider
openhands ACP LLM_API_KEY oh
pi-acp ACP ANTHROPIC_API_KEY pi
openclaw ACP inferred from model

Any agent can be prefixed with acpx/ to run via ACPX (e.g. acpx/gemini, acpx/claude). ACPX is a headless ACP client with persistent sessions and crash recovery. The underlying agent's install, env, credentials, and skill paths are preserved.

Retry and Error Handling

Rollout.run() catches common errors:

  • TimeoutError — agent exceeded timeout
  • ConnectionError — SSH/ACP pipe closed (retried 3x with exponential backoff)
  • ACPError — agent protocol error

Evaluation-level retry with RetryConfig:

from benchflow.evaluation import Evaluation, EvaluationConfig, RetryConfig

config = EvaluationConfig(
    retry=RetryConfig(
        max_retries=2,
        wait_multiplier=2.0,
        min_wait_sec=1.0,
        max_wait_sec=30.0,
    ),
)

v0.4 Types

Sandbox Protocol

The Sandbox protocol defines the interface any sandbox backend must implement. Docker and Daytona are built-in; you can bring your own (Modal, Firecracker, E2B, etc.).

from benchflow import Sandbox, ImageBuilder, ImageConfig, ImageRef

# Sandbox is a runtime-checkable Protocol
class MySandbox:
    async def exec(self, cmd: list[str], ...) -> SandboxExecResult: ...
    async def read_file(self, path: str) -> str: ...
    async def write_file(self, path: str, content: str) -> None: ...
    async def stop(self) -> None: ...
    # ... see sandbox/ package for full protocol

assert isinstance(my_sandbox, Sandbox)  # works at runtime

Rubric + RewardFunc (Composable Rewards)

Declarative scoring via composable reward functions.

from benchflow import Rubric, RewardFunc, RewardEvent, VerifyResult
from benchflow import TestRewardFunc, StringMatchRewardFunc, LLMJudgeRewardFunc

# Built-in reward functions
test_reward = TestRewardFunc()          # runs pytest, binary pass/fail
match_reward = StringMatchRewardFunc(expected="hello world")

# Compose into a weighted Rubric
rubric = Rubric(
    reward_funcs=[test_reward, match_reward],
    weights=[0.7, 0.3],
)

# Score a workspace
result: VerifyResult = await rubric.score(rollout_dir=my_rollout_dir)
print(result.reward)      # weighted float [0.0, 1.0]
print(result.events)      # list[RewardEvent] — per-function breakdown

Adapters (Inspect AI + ORS)

Convert between BenchFlow types and external frameworks.

from benchflow import InspectAdapter, ORSAdapter, to_inspect_task, to_ors_reward

# BenchFlow Scene → Inspect AI task format
inspect_task = to_inspect_task(scene, rubric=rubric)

# BenchFlow VerifyResult → ORS reward format
ors_payload = to_ors_reward(verify_result)

Evaluation

Batch orchestration with concurrency and retries.

from benchflow import Evaluation, EvaluationConfig, EvaluationResult

# EvaluationConfig wraps multiple RolloutConfigs
config = EvaluationConfig(
    rollouts=[rollout_config_1, rollout_config_2, ...],
    concurrency=8,
    retry=RetryConfig(max_retries=2),
)
eval_result: EvaluationResult = await Evaluation.run(config)