Releases · benchflow-ai/benchflow

20 May 06:55

xdotli

v0.4.0

748301f

v0.4.0 — Rollout/Sandbox architecture release Latest

Latest

What changed

BenchFlow v0.4.0 is the architecture release. It is much larger than the hosted-environment adapter alone: the release moves BenchFlow onto the Rollout/Sandbox core, removes the old Harbor-centered framing, adds composable rewards and external adapters, refreshes CLI/docs, and closes a set of dogfood regressions found while preparing the v0.4 boundary.

Compared with v0.3.3, this release changes 210 files with roughly 18k lines added and 6k removed.

Install

pip install benchflow==0.4.0

PyPI is published and verified as benchflow==0.4.0.

Core architecture

Makes Rollout the canonical execution unit and Evaluation the canonical batch/eval accounting path. Legacy names such as Trial, TrialConfig, Job, and JobConfig remain as compatibility aliases, but docs and examples are now Rollout-first.
Adds a unified Scene/Role/Turn type layer with per-role configuration, parallel_group, and cleaner role/scene modeling.
Introduces the BenchFlow-native Sandbox and ImageBuilder protocols, with Docker, Daytona, and Modal implementations. Harbor is no longer a core dependency or the conceptual center of the runtime.
Consolidates the public API and module layout around benchflow.rollout, benchflow.evaluation, benchflow.sandbox, benchflow.task, benchflow.traces, and provider/agent utility modules.
Adds reusable task path/config/verifier helpers and trace import/generation infrastructure.

Public PRs: #274, #294. Refactor sub-PRs: #261, #262, #268.

Rewards and verifiers

Adds a composable rewards package with Rubric, RewardFunc, RewardEvent, built-in reward functions, rubric config loading, and file-reader helpers.
Adds first-class LLM-as-judge verifier support, including dense reward events and rubric-style judging flows.
Fixes reward/metrics edge cases such as None rewards and non-finite ORS JSON values.

Public PRs: #274, #277, #294. Refactor sub-PR: #266.

External adapters and hosted environments

Adds external framework adapters for Inspect AI and OpenAI Reinforcement Store / ORS-style JSON.
Adds first-class hosted environment support for PrimeIntellect / Verifiers:
- bench eval create --source-env ...
- bench environment list --hub primeintellect
- bench environment show
- bench environment inspect
Hosted environments keep native identity (env_uid, hub_url) and provider-owned harness/sandbox behavior instead of being treated as BenchFlow task directories or compatibility shims.
Hosted-env runs install the versioned provider package into an isolated venv, run through vf-eval, and record logs, command, reward, tool calls, provider errors, and metadata.

Public PRs: #274, #290, #294. Refactor sub-PR: #271.

CLI and docs

Promotes bench eval create as the first-class execution command and deprecates old bench run style flows.
Standardizes on --sandbox terminology and removes short one-letter flags in favor of explicit long flags.
Refreshes README, concepts, getting-started, running-benchmarks, Python API, CLI reference, task-authoring, skill-eval, and integration-test docs for the v0.4 surface.
Removes stale Trial/Harbor migration language from the docs and examples.
Adds the skills/citation-management eval asset and a v0.4 skill-eval report.

Public PRs: #274, #281, #282, #284, #294.

Agents, auth, and sandbox hardening

Adds Codex subscription/access-token auth support in Daytona and native codex-acp flows:
- auto-inherits CODEX_ACCESS_TOKEN / CODEX_API_KEY
- maps CODEX_API_KEY to OPENAI_API_KEY for native Codex auth writing
- keeps subscription/auth-token paths out of custom OpenAI-compatible endpoints
Auto-loads .env for CLI execution so users do not need set -a && source .env && set +a for common provider variables.
Improves provider/base-url forwarding for agent containers and custom endpoints.
Fixes skill double-deploy behavior for Dockerfile-injected tasks.
Hardens shell usage in scene/snapshot helpers with quoted paths and path traversal validation.
Fixes Modal/Daytona/runtime edge cases found in dogfood, including Modal optional dependencies, sandbox-user normalization, remote repo helper imports, and eval-list/skill-eval summary handling.

Public PRs: #285, #286, #294, #296. Original hardening/fix PRs: #230, #242.

Benchmarks and integrations

Carries forward the Harvey LAB adapter/converter work and aligns benchmark runners with the v0.4 Rollout/Sandbox API.
Updates ProgramBench integration to the new source layout and v0.4 runner conventions.
Adds integration-suite release readiness checks and adapter evidence tooling.
Updates conformance scripts and example scripts for the new auth, sandbox, and CLI conventions.

Public PRs: #239, #237, #274, #294, #296.

Validation

Release artifact validation from the clean v0.4.0 tag worktree:

uv build
uv run --extra dev python -m pytest tests/test_hosted_env.py tests/test_reexport.py tests/test_yaml_config.py
uv run ruff check .
uv run ty check src/

Result: build succeeded, focused release tests passed (31 passed), ruff passed, and ty passed.

GitHub Actions on main for the merge commit passed.

Hosted PrimeIntellect / Verifiers smoke passed with explicit provider sampling args:

uv run bench eval create \
  --source-env primeintellect/general-agent \
  --source-env-version 0.1.1 \
  --source-env-arg task=calendar_scheduling_t0 \
  --agent gemini \
  --model google/gemini-2.5-flash-lite \
  --source-env-sampling-arg reasoning_effort=minimal

Result: reward 1.0, tool calls 2, no Verifiers error.

A smoke without reasoning_effort=minimal still completed the harness path but scored 0.0; provider/model-specific sampling options are now explicit and should be passed when needed.

Merged PRs

#274 — v0.4 architecture consolidation: unified types, Rollout execution path, Sandbox protocol, rewards, agent capabilities, Inspect/ORS adapters, docs cleanup. Consolidates #261, #262, #265, #266, #268, and #271.
#285 — backports skills double-deploy and shell-injection hardening from #230 and #242.
#294 — v0.4 refactor integration on main: module consolidation, package extras, .env loading, trace import, task helpers, docs, skill eval assets, sandbox backends.
#296 — Codex subscription/access-token auth support in Daytona and native Codex flows.
#290 — hosted PrimeIntellect / Verifiers environment source adapter and bench environment hosted-hub commands.

Compatibility notes

The old Trial/Job names are still available as aliases, but new code should use Rollout/Evaluation terminology.
--sandbox is the canonical sandbox flag.
Short CLI flags were removed; use explicit long flags.
Hosted environments are provider-owned harnesses. BenchFlow records and orchestrates them, but --sandbox remains for native BenchFlow task sources.

Assets 4

25 Apr 10:55

xdotli

v0.3.2

23b4de4

v0.3.2 — BaseUser, verifier hardening, DinD compose

Highlights

BaseUser progressive-disclosure abstraction (#194): Python callback drives multi-round agent runs. Built for SWE-bench Pro use case (Josh @ GitHub) and as parity answer to Harbor #1316 in the no-second-LLM case. See docs/progressive-disclosure.md.
Per-task [verifier.hardening] opt-outs in task.toml (#194): tasks with legitimate conftest.py setups (qutebrowser-style) opt out of specific cleanup steps. Achieves 5/5 SWE-bench Pro oracle on hardened verifier.
DinD compose ACP via Daytona PTY WebSocket (#193, #196): live agent pipes for SkillsBench / DinD compose tasks.
--rootdir=/app in PYTEST_ADDOPTS (#194): anchors test node IDs to repo root; openlibrary oracle goes 0/18 → 18/18.

Fixes

cfg.agent_env reaches connect_as() (#191, closes #190): YAML-supplied provider creds now reach the agent.
DinD env-file path mismatch (#198): shlex.join() was quoting $$ literally so written/read paths diverged; switched to uuid.uuid4() for unique paths.
OpenHands sandbox launch + ACP CLI path (#182).
Stop copying root tool installs into sandbox home (#181, closes #178).
sandbox_setup_timeout wired through configs (#180).

Chores

Repo-wide ruff lint debt cleanup (#197): 126 errors → 0.
Docs: uv tool install (#176).

SWE-bench Pro validation

Oracle 5/5 on Daytona (ansible, flipt, openlibrary, navidrome, qutebrowser).
Single-round Gemini 3.1 Pro baseline: 2/4.

Install

pip install benchflow==0.3.2

Assets 2

16 Apr 00:14

xdotli

v0.2.3

0e5c9e9

v0.2.3 — verifier hardening follow-ups

Added

benchmarks/tb2_multiturn-claude-haiku45.yaml — shipped config for the README's TB2 multi-turn Claude result.
Daytona resource clamping via BENCHFLOW_DAYTONA_MAX_CPUS / MAX_MEMORY_MB.

Changed

Renamed skillsbench-claude-glm5.yaml → skillsbench-claude-glm51.yaml to match the model ID.
codex --login correction in docs/getting-started.md.
Restricted sdist build to src/, tests/, and metadata.

Fixed

Verifier sandbox hardening follow-ups across several base-image and tooling edge cases.
Preserve trusted verifier path entries and workspace answer files.
Redirect oracle output to container log.
Align YAML path resolution to config file location.

Assets 2

14 Apr 18:49

xdotli

v0.2.2

540b011

v0.2.2 — BenchJack defenses: 32.6% → 0.15% exploit rate on 666 tasks

Added

Sandbox hardening tiers 1–4 — layered defense blocking F1–F6 red-team findings: env scrubbing, path lockdown, workspace freeze, wider build-config snapshot, oracle privilege drop, and a filesystem scrub that deletes injected conftest.py / .pth / sitecustomize.py files outside /tests plus *.py drops in /tmp and /var/tmp before every verifier run.
labs/reward-hack-matrix — end-to-end sweep of BenchJack-shaped exploits against 666 real tasks across skillsbench, swebench-verified, and terminal-bench-2 (1332 trials). Per-cell results in sweep_0.2.0_vs_0.2.2.json. Includes per-trial timeout and a long-lived worker pool that keeps local RAM ~1 GB regardless of trial count.

Fixed

Multiple sandbox bypass vectors identified in red-team testing (F1–F6).
PYTHONHOME="" crashing Py_Initialize — empty value is NOT equivalent to unset; dropped from VERIFIER_ENV.
PYTHONSAFEPATH=1 breaking matplotlib setupext.py imports — dropped from VERIFIER_ENV.
pytest_plugins AttributeError during hardened verify — guarded with getattr(...).
matplotlib LFS EOVERFLOW on qhull build artifacts — replaced rmtree + copytree fallback with shutil.copytree(dirs_exist_ok=True) merge-copy in harden_before_verify so inert overlay stragglers no longer block workspace restore.

Results

BenchJack-shaped exploit success rate on 666 real tasks:

benchmark	tasks	0.2.0 EXPLT	0.2.2 EXPLT	Δ
`skillsbench`	77	16 (20.8%)	0 (0%)	−20.8 pp
`swebench-verified`	500	119 (23.8%)	1 (0.2%)¹	−23.6 pp
`terminal-bench-2`	89	82 (92.1%)	0 (0%)	−92.1 pp
total	666	217 (32.6%)	1 (0.15%)	−32.4 pp

¹ swebench-verified/django__django-7530 scores reward = 1.0 on BOTH versions because its FAIL_TO_PASS test passes at baseline without any patch — a SWE-bench task-definition quirk, not a 0.2.2 bypass. True bypass count on 0.2.2: 0.

Install: pip install benchflow==0.2.2

Full audit: labs/reward-hack-matrix/ — Hardening design: .dev-docs/harden-sandbox.md

Assets 2

14 Apr 18:49

xdotli

v0.2.1

27b5139

v0.2.1 — Sandbox hardening on by default

Added

Sandbox hardening on by default — sandbox_user now defaults to "agent" (was None/root). Blocks conftest-hook and answer-lookup exploit patterns.
Path lockdown — new sandbox_locked_paths parameter makes /solution and /tests read-only before the verifier runs, blocking .pth-injection and similar pre-verify tampering.
Verifier failure isolation — agent errors and verifier errors are now stored separately; a crashing verifier no longer masks the agent result.
labs/benchjack-sandbox-hardening — cookbook demonstrating three exploit patterns (P1 conftest-hook, P2 answer-lookup, P7 .pth-injection) and their defenses.

Fixed

Oracle runs as sandbox_user — oracle agent now respects path lockdown instead of running as root and bypassing it.
Multi-endpoint provider routing — providers with multiple endpoints now route by the agent's native API protocol.
Stale API key shadowing subscription auth — emits a warning when ANTHROPIC_API_KEY env var is present alongside claude login credentials.
pytest ini-injection bypass — closed a verifier hardening edge case.

Changed

Version is now single-sourced via importlib.metadata; no more duplicate version string in __init__.py.
User-facing docs — new docs/ directory with getting-started guide, CLI reference, architecture overview, task-authoring guide, and labs index. README trimmed; detailed content moved to docs/.

Install: pip install benchflow==0.2.1

Assets 2

09 Apr 22:27

xdotli

v0.2.0

01ee396

v0.2.0 — First Public Release

First public release 🎉

Major rearchitecture from the 0.1.x era. API surface has changed — assume breaking changes. Future releases will maintain compatibility within the 0.2.x line.

Install

pip install benchflow

Highlights

Multi-agent, multi-provider, multi-auth matrix — 12 end-to-end tested agent × model × provider × auth combinations (docs/tested-agents.md)
Subscription auth — use claude login, codex --login, gemini OAuth directly, no API keys required
Vertex AI support — ADC auth for google-vertex/, anthropic-vertex/, vertex-zai/ prefixed models
Provider registry — data-driven custom LLM endpoints via ProviderConfig. Adding a new provider = single registry entry
vLLM support — benchflow run -a pi-acp --model vllm/Qwen3-80B --ae BENCHFLOW_PROVIDER_BASE_URL=...
SDK refactor — SDK.run() decomposed into focused private methods; core modules extracted (_models.py, _trajectory.py, _env_setup.py, _scoring.py)
Harbor switched to PyPI — harbor==0.3.0 pin, no more git URL dependency
benchmarks/ directory with reusable YAML configs and runner scripts for TB2 and SkillsBench
benchflow tasks init / tasks check commands for scaffolding and validating new tasks
Oracle agent support — run solution/solve.sh directly for task validation
232 unit tests (up from 66 in 0.1.x)

Benchmark Results

Benchmark	Agent	Model	Score
TB2 single-turn	`codex-acp`	GPT-5.4	69.7% (62/89)
TB2 single-turn	`claude-agent-acp`	Sonnet 4.6	58.4% (52/89)
TB2 multi-turn	`codex-acp`	GPT-5.4	62.9% (56/89)
TB2 multi-turn	`claude-agent-acp`	Haiku 4.5	37.1% (33/89)
SkillsBench	`codex-acp`	GPT-5.4	37.2% (32/86)

Notable finding: Multi-turn self-critique hurts capable models (GPT-5.4 regresses −6.8pp from single-turn) but helps weaker models (Haiku 4.5 gains +9.6pp).

Security fixes

API keys no longer leak in ps aux
ADC credentials fixed for sandbox_user setups (#111)
Daytona sandbox orphan cleanup with --max-age filter (#102)
litellm upgraded to 1.83.0 for CVE-2026-35030
cryptography upgraded to 46.0.7 for CVE-2026-39892
13 transitive Dependabot alerts resolved

Contributors

@kywch (Kyoung Whan Choe) — core refactor, benchmark runs, agent test scripts
@xdotli (Xiangyi Li) — SDK, providers, Vertex AI, OpenClaw shim, subscription auth

Full changelog

https://github.com/benchflow-ai/benchflow/blob/main/CHANGELOG.md

Migration from 0.1.x

0.1.x users should treat this as a fresh install. The SDK API, CLI, registry pattern, and task format have all changed. There is no automatic migration path. See docs/sdk-reference.md and examples/ for starting points.

Contributors

kywch and xdotli

Assets 2

07 Mar 18:04

kirk-xuhj

v0.1.12

c407dfd

v0.1.12

What's Changed

add crag benchmark by @danielfang001 in #12
fix: remove binary code by @kk-xuhj in #13
Fix/crag by @xdotli in #14
Feat/hot load by @xdotli in #16
Docs/readme by @kk-xuhj in #17
docs: udpate readme by @kk-xuhj in #18
Fix/demo agents by @kk-xuhj in #19

New Contributors

@danielfang001 made their first contribution in #12

Full Changelog: v0.1.8...v0.1.12

Contributors

danielfang001, xdotli, and kirk-xuhj

Assets 2

27 Feb 02:36

kirk-xuhj

v0.1.8

a177063

v0.1.8

What's Changed

Feat/v0.1.5 by @kk-xuhj in #1
Feat/new_interface_for_benchmark by @kk-xuhj in #2
add swebench by @kk-xuhj in #3
Readme suggestion by @tom-doerr in #4
Feat/bff integrate by @kk-xuhj in #6
Docs/better docs by @kk-xuhj in #7
feat: add MMLU-PRO by @kk-xuhj in #9
Feat/type_checking by @kk-xuhj in #10

New Contributors

@kk-xuhj made their first contribution in #1
@tom-doerr made their first contribution in #4

Full Changelog: https://github.com/benchflow-ai/benchflow/commits/v0.1.8

Contributors

tom-doerr and kirk-xuhj

Assets 2

Releases: benchflow-ai/benchflow

v0.4.0 — Rollout/Sandbox architecture release

What changed

Install

Core architecture

Rewards and verifiers

External adapters and hosted environments

CLI and docs

Agents, auth, and sandbox hardening

Benchmarks and integrations

Validation

Merged PRs

Compatibility notes

Uh oh!

v0.3.2 — BaseUser, verifier hardening, DinD compose

Highlights

Fixes

Chores

SWE-bench Pro validation

Install

Uh oh!

v0.2.3 — verifier hardening follow-ups

Added

Changed

Fixed

Uh oh!

v0.2.2 — BenchJack defenses: 32.6% → 0.15% exploit rate on 666 tasks

Added

Fixed

Results

Uh oh!

v0.2.1 — Sandbox hardening on by default

Added

Fixed

Changed

Uh oh!

v0.2.0 — First Public Release

Install

Highlights

Benchmark Results

Security fixes

Contributors

Full changelog

Migration from 0.1.x

Contributors

Uh oh!

v0.1.12

What's Changed

New Contributors

Contributors

Uh oh!

v0.1.8

What's Changed

New Contributors

Contributors

Uh oh!