The Rollout/Scene API is the primary way to run agent benchmarks programmatically.
uv tool install benchflowimport asyncio
import benchflow as bf
result = asyncio.run(bf.run("gemini", task_path="tasks/my-task", model="gemini-3.1-flash-lite-preview"))
print(f"Reward: {result.rewards}")
print(f"Tool calls: {result.n_tool_calls}")Declarative configuration for a rollout — a sequence of Scenes in a shared sandbox.
from pathlib import Path
from benchflow import RolloutConfig, Scene, Role, Turn
# Single-agent (simplest)
config = RolloutConfig(
task_path=Path("tasks/my-task"),
scenes=[Scene.single(agent="gemini", model="gemini-3.1-flash-lite-preview")],
environment="daytona",
sandbox_setup_timeout=120,
)
# Multi-scene BYOS (skill-gen → solve)
config = RolloutConfig(
task_path=Path("tasks/my-task"),
scenes=[
Scene(name="prep", roles=[Role("gen", "gemini", "gemini-3.1-flash-lite-preview")],
turns=[Turn("gen", "Generate a skill for this task...")]),
Scene(name="solve", roles=[Role("solver", "gemini", "gemini-3.1-flash-lite-preview")],
turns=[Turn("solver")]),
],
environment="daytona",
sandbox_setup_timeout=120,
)Set sandbox_setup_timeout when sandbox user setup needs more than the default 120 seconds.
The same field is also available on JobConfig and RuntimeConfig.
One interaction region — roles take turns executing prompts.
# Single-role shortcut
scene = Scene.single(agent="gemini", model="gemini-3.1-flash-lite-preview")
# Multi-role with turn order (coder-reviewer pattern)
# Agents communicate via outbox: write /app/.outbox/{recipient}.json
# Scheduler reads outbox after each turn, injects into next role's prompt
scene = Scene(
name="coder-reviewer",
roles=[
Role("coder", "gemini", "gemini-3.1-flash-lite-preview"),
Role("reviewer", "gemini", "gemini-3.1-flash-lite-preview"),
],
turns=[
Turn("coder"), # None prompt = instruction.md
Turn("reviewer", "Review the code. Write feedback to "
'/app/.outbox/coder.json as {"to":"coder","content":"..."}'),
Turn("coder", "Fix the issues."), # reviewer's feedback auto-injected
],
)The execution engine — decomposed into independently-callable phases.
from benchflow import Rollout
rollout = await Rollout.create(config)
# Full lifecycle (most common)
result = await rollout.run()
# Manual composition (for custom flows)
await rollout.setup()
await rollout.start()
await rollout.install_agent()
await rollout.connect()
await rollout.execute(prompts=["custom prompt"])
await rollout.disconnect()
await rollout.verify()
await rollout.cleanup()Runtime-level configuration for the Agent + Environment execution path.
from benchflow.runtime import Agent, Environment, Runtime, RuntimeConfig
config = RuntimeConfig(sandbox_setup_timeout=300)
agent = Agent("gemini", model="gemini-3.1-flash-lite-preview")
env = Environment.from_task("tasks/X", sandbox="daytona")
runtime = Runtime(env, agent, config=config)
result = await runtime.execute()Convenience function — multiple calling conventions:
import benchflow as bf
# 1. RolloutConfig (full control)
result = await bf.run(config)
# 2. Agent + Environment (0.3 style)
agent = bf.Agent("gemini", model="gemini-3.1-flash-lite-preview")
env = bf.Environment.from_task("tasks/X", sandbox="daytona")
runtime_config = bf.RuntimeConfig(sandbox_setup_timeout=300)
result = await bf.run(agent, env, runtime_config)
# 3. String shortcut (simplest)
result = await bf.run(
"gemini",
task_path="tasks/X",
model="gemini-3.1-flash-lite-preview",
config=bf.RuntimeConfig(sandbox_setup_timeout=300),
)Rollout.run()
│
├─ setup() — resolve config, create env object
├─ start() — spin up sandbox, upload task files, start services
├─ install_agent() — install agent binary, credentials, sandbox user
│ (sandbox user setup: create non-root user, prepare
│ small config/auth dirs, chown the workspace — no
│ recursive copy of /root tool trees; agent binaries
│ must live on shared prefixes like /usr/local/bin)
├─ for scene in scenes:
│ └─ _run_scene(scene)
│ ├─ setup /app/.outbox/ — (multi-role scenes only)
│ └─ for turn in scene.turns:
│ ├─ read outbox — inject messages into prompt
│ ├─ connect_as(role) — open ACP session for this role
│ ├─ execute(prompts) — send prompts, collect trajectory
│ └─ disconnect() — kill agent process, clean up
├─ verify() — run verifier, collect rewards
└─ cleanup() — stop sandbox
Key: disconnect() kills the agent process between scenes to prevent context bleed. Each scene gets a fresh agent session.
| Pattern | Roles | Turns | Communication | Example |
|---|---|---|---|---|
| Single-turn | 1 | 1 | — | Baseline benchmark |
| Multi-turn | 1 | 2+ | Same session, sequential prompts | Self-review |
| Multi-round | 2+ | 2+ | Outbox files between roles | Coder + Reviewer |
Multi-turn = same agent gets multiple prompts. Use when a second pass catches errors (self-review, iterative refinement). The agent keeps its context across turns.
Multi-round = different agents exchange turns. Use when tasks need multiple perspectives (code review, client-advisor). The scheduler reads outbox files and injects messages.
Both use the same API — RolloutConfig with different Scene configurations.
config = RolloutConfig(
task_path=task_path,
scenes=[Scene(
roles=[Role("coder", "gemini", "flash"), Role("reviewer", "gemini", "flash")],
turns=[
Turn("coder"),
Turn("reviewer", "Review /app/. Write feedback to /app/.outbox/coder.json"),
Turn("coder", "Read feedback and fix."),
],
)],
environment="daytona",
)config = RolloutConfig(
task_path=task_path,
scenes=[
Scene(name="skill-gen",
roles=[Role("gen", "gemini", "flash")],
turns=[Turn("gen", "Generate a skill document to /app/generated-skill.md")]),
Scene(name="solve",
roles=[Role("solver", "gemini", "flash")],
turns=[Turn("solver")]),
],
environment="daytona",
)Use BaseUser or FunctionUser when one agent should run multiple rounds and
Python should decide the next prompt from verifier feedback. This is the
progressive-disclosure path: the user callback can stop early, read
RoundResult after each soft_verify(), and optionally receive the oracle
solution during setup() when oracle_access=True.
from pathlib import Path
from benchflow import FunctionUser, RolloutConfig, RoundResult, Scene
def user(round: int, instruction: str, rr: RoundResult | None) -> str | None:
if round == 0:
return instruction.splitlines()[0]
if rr and (rr.rewards or {}).get("reward") == 1.0:
return None
return f"Tests failed:\n{rr.verifier_output}\n\nUse the full spec:\n{instruction}"
config = RolloutConfig(
task_path=Path("tasks/my-task"),
scenes=[Scene.single(agent="gemini", model="gemini-3.1-flash-lite-preview")],
user=FunctionUser(user),
max_user_rounds=3,
environment="daytona",
)
result = await bf.run(config)Use multi-role Scenes when another LLM should act as the reviewer or simulated
user. Use BaseUser when the loop is deterministic or verifier-driven. See
progressive-disclosure.md and
docs/examples/scene-patterns.ipynb.
from benchflow._utils.yaml_loader import rollout_config_from_yaml
config = rollout_config_from_yaml("rollout.yaml")
result = await bf.run(config)| Agent | Protocol | Auth | Aliases |
|---|---|---|---|
gemini |
ACP | GEMINI_API_KEY | — |
claude-agent-acp |
ACP | ANTHROPIC_API_KEY | claude |
codex-acp |
ACP | OPENAI_API_KEY, CODEX_API_KEY, CODEX_ACCESS_TOKEN, or host login | codex |
opencode |
ACP | inferred from model/provider | — |
openhands |
ACP | LLM_API_KEY | oh |
pi-acp |
ACP | ANTHROPIC_API_KEY | pi |
openclaw |
ACP | inferred from model | — |
Any agent can be prefixed with acpx/ to run via ACPX (e.g. acpx/gemini, acpx/claude). ACPX is a headless ACP client with persistent sessions and crash recovery. The underlying agent's install, env, credentials, and skill paths are preserved.
Rollout.run() catches common errors:
TimeoutError— agent exceeded timeoutConnectionError— SSH/ACP pipe closed (retried 3x with exponential backoff)ACPError— agent protocol error
Evaluation-level retry with RetryConfig:
from benchflow.evaluation import Evaluation, EvaluationConfig, RetryConfig
config = EvaluationConfig(
retry=RetryConfig(
max_retries=2,
wait_multiplier=2.0,
min_wait_sec=1.0,
max_wait_sec=30.0,
),
)The Sandbox protocol defines the interface any sandbox backend must implement.
Docker and Daytona are built-in; you can bring your own (Modal, Firecracker, E2B, etc.).
from benchflow import Sandbox, ImageBuilder, ImageConfig, ImageRef
# Sandbox is a runtime-checkable Protocol
class MySandbox:
async def exec(self, cmd: list[str], ...) -> SandboxExecResult: ...
async def read_file(self, path: str) -> str: ...
async def write_file(self, path: str, content: str) -> None: ...
async def stop(self) -> None: ...
# ... see sandbox/ package for full protocol
assert isinstance(my_sandbox, Sandbox) # works at runtimeDeclarative scoring via composable reward functions.
from benchflow import Rubric, RewardFunc, RewardEvent, VerifyResult
from benchflow import TestRewardFunc, StringMatchRewardFunc, LLMJudgeRewardFunc
# Built-in reward functions
test_reward = TestRewardFunc() # runs pytest, binary pass/fail
match_reward = StringMatchRewardFunc(expected="hello world")
# Compose into a weighted Rubric
rubric = Rubric(
reward_funcs=[test_reward, match_reward],
weights=[0.7, 0.3],
)
# Score a workspace
result: VerifyResult = await rubric.score(rollout_dir=my_rollout_dir)
print(result.reward) # weighted float [0.0, 1.0]
print(result.events) # list[RewardEvent] — per-function breakdownConvert between BenchFlow types and external frameworks.
from benchflow import InspectAdapter, ORSAdapter, to_inspect_task, to_ors_reward
# BenchFlow Scene → Inspect AI task format
inspect_task = to_inspect_task(scene, rubric=rubric)
# BenchFlow VerifyResult → ORS reward format
ors_payload = to_ors_reward(verify_result)Batch orchestration with concurrency and retries.
from benchflow import Evaluation, EvaluationConfig, EvaluationResult
# EvaluationConfig wraps multiple RolloutConfigs
config = EvaluationConfig(
rollouts=[rollout_config_1, rollout_config_2, ...],
concurrency=8,
retry=RetryConfig(max_retries=2),
)
eval_result: EvaluationResult = await Evaluation.run(config)