Releases: benchflow-ai/benchflow
v0.4.0 — Rollout/Sandbox architecture release
What changed
BenchFlow v0.4.0 is the architecture release. It is much larger than the hosted-environment adapter alone: the release moves BenchFlow onto the Rollout/Sandbox core, removes the old Harbor-centered framing, adds composable rewards and external adapters, refreshes CLI/docs, and closes a set of dogfood regressions found while preparing the v0.4 boundary.
Compared with v0.3.3, this release changes 210 files with roughly 18k lines added and 6k removed.
Install
pip install benchflow==0.4.0PyPI is published and verified as benchflow==0.4.0.
Core architecture
- Makes
Rolloutthe canonical execution unit andEvaluationthe canonical batch/eval accounting path. Legacy names such asTrial,TrialConfig,Job, andJobConfigremain as compatibility aliases, but docs and examples are now Rollout-first. - Adds a unified Scene/Role/Turn type layer with per-role configuration,
parallel_group, and cleaner role/scene modeling. - Introduces the BenchFlow-native
SandboxandImageBuilderprotocols, with Docker, Daytona, and Modal implementations. Harbor is no longer a core dependency or the conceptual center of the runtime. - Consolidates the public API and module layout around
benchflow.rollout,benchflow.evaluation,benchflow.sandbox,benchflow.task,benchflow.traces, and provider/agent utility modules. - Adds reusable task path/config/verifier helpers and trace import/generation infrastructure.
Public PRs: #274, #294. Refactor sub-PRs: #261, #262, #268.
Rewards and verifiers
- Adds a composable rewards package with
Rubric,RewardFunc,RewardEvent, built-in reward functions, rubric config loading, and file-reader helpers. - Adds first-class LLM-as-judge verifier support, including dense reward events and rubric-style judging flows.
- Fixes reward/metrics edge cases such as
Nonerewards and non-finite ORS JSON values.
Public PRs: #274, #277, #294. Refactor sub-PR: #266.
External adapters and hosted environments
- Adds external framework adapters for Inspect AI and OpenAI Reinforcement Store / ORS-style JSON.
- Adds first-class hosted environment support for PrimeIntellect / Verifiers:
bench eval create --source-env ...bench environment list --hub primeintellectbench environment showbench environment inspect
- Hosted environments keep native identity (
env_uid,hub_url) and provider-owned harness/sandbox behavior instead of being treated as BenchFlow task directories or compatibility shims. - Hosted-env runs install the versioned provider package into an isolated venv, run through
vf-eval, and record logs, command, reward, tool calls, provider errors, and metadata.
Public PRs: #274, #290, #294. Refactor sub-PR: #271.
CLI and docs
- Promotes
bench eval createas the first-class execution command and deprecates oldbench runstyle flows. - Standardizes on
--sandboxterminology and removes short one-letter flags in favor of explicit long flags. - Refreshes README, concepts, getting-started, running-benchmarks, Python API, CLI reference, task-authoring, skill-eval, and integration-test docs for the v0.4 surface.
- Removes stale Trial/Harbor migration language from the docs and examples.
- Adds the
skills/citation-managementeval asset and a v0.4 skill-eval report.
Public PRs: #274, #281, #282, #284, #294.
Agents, auth, and sandbox hardening
- Adds Codex subscription/access-token auth support in Daytona and native
codex-acpflows:- auto-inherits
CODEX_ACCESS_TOKEN/CODEX_API_KEY - maps
CODEX_API_KEYtoOPENAI_API_KEYfor native Codex auth writing - keeps subscription/auth-token paths out of custom OpenAI-compatible endpoints
- auto-inherits
- Auto-loads
.envfor CLI execution so users do not needset -a && source .env && set +afor common provider variables. - Improves provider/base-url forwarding for agent containers and custom endpoints.
- Fixes skill double-deploy behavior for Dockerfile-injected tasks.
- Hardens shell usage in scene/snapshot helpers with quoted paths and path traversal validation.
- Fixes Modal/Daytona/runtime edge cases found in dogfood, including Modal optional dependencies, sandbox-user normalization, remote repo helper imports, and eval-list/skill-eval summary handling.
Public PRs: #285, #286, #294, #296. Original hardening/fix PRs: #230, #242.
Benchmarks and integrations
- Carries forward the Harvey LAB adapter/converter work and aligns benchmark runners with the v0.4 Rollout/Sandbox API.
- Updates ProgramBench integration to the new source layout and v0.4 runner conventions.
- Adds integration-suite release readiness checks and adapter evidence tooling.
- Updates conformance scripts and example scripts for the new auth, sandbox, and CLI conventions.
Public PRs: #239, #237, #274, #294, #296.
Validation
Release artifact validation from the clean v0.4.0 tag worktree:
uv build
uv run --extra dev python -m pytest tests/test_hosted_env.py tests/test_reexport.py tests/test_yaml_config.py
uv run ruff check .
uv run ty check src/Result: build succeeded, focused release tests passed (31 passed), ruff passed, and ty passed.
GitHub Actions on main for the merge commit passed.
Hosted PrimeIntellect / Verifiers smoke passed with explicit provider sampling args:
uv run bench eval create \
--source-env primeintellect/general-agent \
--source-env-version 0.1.1 \
--source-env-arg task=calendar_scheduling_t0 \
--agent gemini \
--model google/gemini-2.5-flash-lite \
--source-env-sampling-arg reasoning_effort=minimalResult: reward 1.0, tool calls 2, no Verifiers error.
A smoke without reasoning_effort=minimal still completed the harness path but scored 0.0; provider/model-specific sampling options are now explicit and should be passed when needed.
Merged PRs
- #274 — v0.4 architecture consolidation: unified types, Rollout execution path, Sandbox protocol, rewards, agent capabilities, Inspect/ORS adapters, docs cleanup. Consolidates #261, #262, #265, #266, #268, and #271.
- #285 — backports skills double-deploy and shell-injection hardening from #230 and #242.
- #294 — v0.4 refactor integration on
main: module consolidation, package extras,.envloading, trace import, task helpers, docs, skill eval assets, sandbox backends. - #296 — Codex subscription/access-token auth support in Daytona and native Codex flows.
- #290 — hosted PrimeIntellect / Verifiers environment source adapter and
bench environmenthosted-hub commands.
Compatibility notes
- The old Trial/Job names are still available as aliases, but new code should use Rollout/Evaluation terminology.
--sandboxis the canonical sandbox flag.- Short CLI flags were removed; use explicit long flags.
- Hosted environments are provider-owned harnesses. BenchFlow records and orchestrates them, but
--sandboxremains for native BenchFlow task sources.
v0.3.2 — BaseUser, verifier hardening, DinD compose
Highlights
BaseUserprogressive-disclosure abstraction (#194): Python callback drives multi-round agent runs. Built for SWE-bench Pro use case (Josh @ GitHub) and as parity answer to Harbor #1316 in the no-second-LLM case. Seedocs/progressive-disclosure.md.- Per-task
[verifier.hardening]opt-outs intask.toml(#194): tasks with legitimateconftest.pysetups (qutebrowser-style) opt out of specific cleanup steps. Achieves 5/5 SWE-bench Pro oracle on hardened verifier. - DinD compose ACP via Daytona PTY WebSocket (#193, #196): live agent pipes for SkillsBench / DinD compose tasks.
--rootdir=/appin PYTEST_ADDOPTS (#194): anchors test node IDs to repo root; openlibrary oracle goes 0/18 → 18/18.
Fixes
cfg.agent_envreachesconnect_as()(#191, closes #190): YAML-supplied provider creds now reach the agent.- DinD env-file path mismatch (#198):
shlex.join()was quoting$$literally so written/read paths diverged; switched touuid.uuid4()for unique paths. - OpenHands sandbox launch + ACP CLI path (#182).
- Stop copying root tool installs into sandbox home (#181, closes #178).
sandbox_setup_timeoutwired through configs (#180).
Chores
SWE-bench Pro validation
Oracle 5/5 on Daytona (ansible, flipt, openlibrary, navidrome, qutebrowser).
Single-round Gemini 3.1 Pro baseline: 2/4.
Install
pip install benchflow==0.3.2
v0.2.3 — verifier hardening follow-ups
Added
benchmarks/tb2_multiturn-claude-haiku45.yaml— shipped config for the README's TB2 multi-turn Claude result.- Daytona resource clamping via
BENCHFLOW_DAYTONA_MAX_CPUS/MAX_MEMORY_MB.
Changed
- Renamed
skillsbench-claude-glm5.yaml→skillsbench-claude-glm51.yamlto match the model ID. codex --logincorrection indocs/getting-started.md.- Restricted sdist build to
src/,tests/, and metadata.
Fixed
- Verifier sandbox hardening follow-ups across several base-image and tooling edge cases.
- Preserve trusted verifier path entries and workspace answer files.
- Redirect oracle output to container log.
- Align YAML path resolution to config file location.
v0.2.2 — BenchJack defenses: 32.6% → 0.15% exploit rate on 666 tasks
Added
- Sandbox hardening tiers 1–4 — layered defense blocking F1–F6 red-team findings: env scrubbing, path lockdown, workspace freeze, wider build-config snapshot, oracle privilege drop, and a filesystem scrub that deletes injected
conftest.py/.pth/sitecustomize.pyfiles outside/testsplus*.pydrops in/tmpand/var/tmpbefore every verifier run. labs/reward-hack-matrix— end-to-end sweep of BenchJack-shaped exploits against 666 real tasks across skillsbench, swebench-verified, and terminal-bench-2 (1332 trials). Per-cell results insweep_0.2.0_vs_0.2.2.json. Includes per-trial timeout and a long-lived worker pool that keeps local RAM ~1 GB regardless of trial count.
Fixed
- Multiple sandbox bypass vectors identified in red-team testing (F1–F6).
PYTHONHOME=""crashingPy_Initialize— empty value is NOT equivalent to unset; dropped fromVERIFIER_ENV.PYTHONSAFEPATH=1breaking matplotlibsetupext.pyimports — dropped fromVERIFIER_ENV.pytest_pluginsAttributeError during hardened verify — guarded withgetattr(...).- matplotlib LFS
EOVERFLOWon qhull build artifacts — replacedrmtree + copytreefallback withshutil.copytree(dirs_exist_ok=True)merge-copy inharden_before_verifyso inert overlay stragglers no longer block workspace restore.
Results
BenchJack-shaped exploit success rate on 666 real tasks:
| benchmark | tasks | 0.2.0 EXPLT | 0.2.2 EXPLT | Δ |
|---|---|---|---|---|
skillsbench |
77 | 16 (20.8%) | 0 (0%) | −20.8 pp |
swebench-verified |
500 | 119 (23.8%) | 1 (0.2%)¹ | −23.6 pp |
terminal-bench-2 |
89 | 82 (92.1%) | 0 (0%) | −92.1 pp |
| total | 666 | 217 (32.6%) | 1 (0.15%) | −32.4 pp |
¹ swebench-verified/django__django-7530 scores reward = 1.0 on BOTH versions because its FAIL_TO_PASS test passes at baseline without any patch — a SWE-bench task-definition quirk, not a 0.2.2 bypass. True bypass count on 0.2.2: 0.
Install: pip install benchflow==0.2.2
Full audit: labs/reward-hack-matrix/ — Hardening design: .dev-docs/harden-sandbox.md
v0.2.1 — Sandbox hardening on by default
Added
- Sandbox hardening on by default —
sandbox_usernow defaults to"agent"(wasNone/root). Blocks conftest-hook and answer-lookup exploit patterns. - Path lockdown — new
sandbox_locked_pathsparameter makes/solutionand/testsread-only before the verifier runs, blocking.pth-injection and similar pre-verify tampering. - Verifier failure isolation — agent errors and verifier errors are now stored separately; a crashing verifier no longer masks the agent result.
labs/benchjack-sandbox-hardening— cookbook demonstrating three exploit patterns (P1 conftest-hook, P2 answer-lookup, P7.pth-injection) and their defenses.
Fixed
- Oracle runs as
sandbox_user— oracle agent now respects path lockdown instead of running as root and bypassing it. - Multi-endpoint provider routing — providers with multiple endpoints now route by the agent's native API protocol.
- Stale API key shadowing subscription auth — emits a warning when
ANTHROPIC_API_KEYenv var is present alongsideclaude logincredentials. - pytest
ini-injection bypass — closed a verifier hardening edge case.
Changed
- Version is now single-sourced via
importlib.metadata; no more duplicate version string in__init__.py. - User-facing docs — new
docs/directory with getting-started guide, CLI reference, architecture overview, task-authoring guide, and labs index. README trimmed; detailed content moved todocs/.
Install: pip install benchflow==0.2.1
v0.2.0 — First Public Release
First public release 🎉
Major rearchitecture from the 0.1.x era. API surface has changed — assume breaking changes. Future releases will maintain compatibility within the 0.2.x line.
Install
pip install benchflowHighlights
- Multi-agent, multi-provider, multi-auth matrix — 12 end-to-end tested agent × model × provider × auth combinations (
docs/tested-agents.md) - Subscription auth — use
claude login,codex --login,geminiOAuth directly, no API keys required - Vertex AI support — ADC auth for
google-vertex/,anthropic-vertex/,vertex-zai/prefixed models - Provider registry — data-driven custom LLM endpoints via
ProviderConfig. Adding a new provider = single registry entry - vLLM support —
benchflow run -a pi-acp --model vllm/Qwen3-80B --ae BENCHFLOW_PROVIDER_BASE_URL=... - SDK refactor —
SDK.run()decomposed into focused private methods; core modules extracted (_models.py,_trajectory.py,_env_setup.py,_scoring.py) - Harbor switched to PyPI —
harbor==0.3.0pin, no more git URL dependency benchmarks/directory with reusable YAML configs and runner scripts for TB2 and SkillsBenchbenchflow tasks init/tasks checkcommands for scaffolding and validating new tasks- Oracle agent support — run
solution/solve.shdirectly for task validation - 232 unit tests (up from 66 in 0.1.x)
Benchmark Results
| Benchmark | Agent | Model | Score |
|---|---|---|---|
| TB2 single-turn | codex-acp |
GPT-5.4 | 69.7% (62/89) |
| TB2 single-turn | claude-agent-acp |
Sonnet 4.6 | 58.4% (52/89) |
| TB2 multi-turn | codex-acp |
GPT-5.4 | 62.9% (56/89) |
| TB2 multi-turn | claude-agent-acp |
Haiku 4.5 | 37.1% (33/89) |
| SkillsBench | codex-acp |
GPT-5.4 | 37.2% (32/86) |
Notable finding: Multi-turn self-critique hurts capable models (GPT-5.4 regresses −6.8pp from single-turn) but helps weaker models (Haiku 4.5 gains +9.6pp).
Security fixes
- API keys no longer leak in
ps aux - ADC credentials fixed for
sandbox_usersetups (#111) - Daytona sandbox orphan cleanup with
--max-agefilter (#102) litellmupgraded to 1.83.0 for CVE-2026-35030cryptographyupgraded to 46.0.7 for CVE-2026-39892- 13 transitive Dependabot alerts resolved
Contributors
- @kywch (Kyoung Whan Choe) — core refactor, benchmark runs, agent test scripts
- @xdotli (Xiangyi Li) — SDK, providers, Vertex AI, OpenClaw shim, subscription auth
Full changelog
https://github.com/benchflow-ai/benchflow/blob/main/CHANGELOG.md
Migration from 0.1.x
0.1.x users should treat this as a fresh install. The SDK API, CLI, registry pattern, and task format have all changed. There is no automatic migration path. See docs/sdk-reference.md and examples/ for starting points.
v0.1.12
What's Changed
- add crag benchmark by @danielfang001 in #12
- fix: remove binary code by @kk-xuhj in #13
- Fix/crag by @xdotli in #14
- Feat/hot load by @xdotli in #16
- Docs/readme by @kk-xuhj in #17
- docs: udpate readme by @kk-xuhj in #18
- Fix/demo agents by @kk-xuhj in #19
New Contributors
- @danielfang001 made their first contribution in #12
Full Changelog: v0.1.8...v0.1.12
v0.1.8
What's Changed
- Feat/v0.1.5 by @kk-xuhj in #1
- Feat/new_interface_for_benchmark by @kk-xuhj in #2
- add swebench by @kk-xuhj in #3
- Readme suggestion by @tom-doerr in #4
- Feat/bff integrate by @kk-xuhj in #6
- Docs/better docs by @kk-xuhj in #7
- feat: add MMLU-PRO by @kk-xuhj in #9
- Feat/type_checking by @kk-xuhj in #10
New Contributors
- @kk-xuhj made their first contribution in #1
- @tom-doerr made their first contribution in #4
Full Changelog: https://github.com/benchflow-ai/benchflow/commits/v0.1.8