Skip to content

Releases: benchflow-ai/benchflow

v0.4.0 — Rollout/Sandbox architecture release

20 May 06:55
748301f

Choose a tag to compare

What changed

BenchFlow v0.4.0 is the architecture release. It is much larger than the hosted-environment adapter alone: the release moves BenchFlow onto the Rollout/Sandbox core, removes the old Harbor-centered framing, adds composable rewards and external adapters, refreshes CLI/docs, and closes a set of dogfood regressions found while preparing the v0.4 boundary.

Compared with v0.3.3, this release changes 210 files with roughly 18k lines added and 6k removed.

Install

pip install benchflow==0.4.0

PyPI is published and verified as benchflow==0.4.0.

Core architecture

  • Makes Rollout the canonical execution unit and Evaluation the canonical batch/eval accounting path. Legacy names such as Trial, TrialConfig, Job, and JobConfig remain as compatibility aliases, but docs and examples are now Rollout-first.
  • Adds a unified Scene/Role/Turn type layer with per-role configuration, parallel_group, and cleaner role/scene modeling.
  • Introduces the BenchFlow-native Sandbox and ImageBuilder protocols, with Docker, Daytona, and Modal implementations. Harbor is no longer a core dependency or the conceptual center of the runtime.
  • Consolidates the public API and module layout around benchflow.rollout, benchflow.evaluation, benchflow.sandbox, benchflow.task, benchflow.traces, and provider/agent utility modules.
  • Adds reusable task path/config/verifier helpers and trace import/generation infrastructure.

Public PRs: #274, #294. Refactor sub-PRs: #261, #262, #268.

Rewards and verifiers

  • Adds a composable rewards package with Rubric, RewardFunc, RewardEvent, built-in reward functions, rubric config loading, and file-reader helpers.
  • Adds first-class LLM-as-judge verifier support, including dense reward events and rubric-style judging flows.
  • Fixes reward/metrics edge cases such as None rewards and non-finite ORS JSON values.

Public PRs: #274, #277, #294. Refactor sub-PR: #266.

External adapters and hosted environments

  • Adds external framework adapters for Inspect AI and OpenAI Reinforcement Store / ORS-style JSON.
  • Adds first-class hosted environment support for PrimeIntellect / Verifiers:
    • bench eval create --source-env ...
    • bench environment list --hub primeintellect
    • bench environment show
    • bench environment inspect
  • Hosted environments keep native identity (env_uid, hub_url) and provider-owned harness/sandbox behavior instead of being treated as BenchFlow task directories or compatibility shims.
  • Hosted-env runs install the versioned provider package into an isolated venv, run through vf-eval, and record logs, command, reward, tool calls, provider errors, and metadata.

Public PRs: #274, #290, #294. Refactor sub-PR: #271.

CLI and docs

  • Promotes bench eval create as the first-class execution command and deprecates old bench run style flows.
  • Standardizes on --sandbox terminology and removes short one-letter flags in favor of explicit long flags.
  • Refreshes README, concepts, getting-started, running-benchmarks, Python API, CLI reference, task-authoring, skill-eval, and integration-test docs for the v0.4 surface.
  • Removes stale Trial/Harbor migration language from the docs and examples.
  • Adds the skills/citation-management eval asset and a v0.4 skill-eval report.

Public PRs: #274, #281, #282, #284, #294.

Agents, auth, and sandbox hardening

  • Adds Codex subscription/access-token auth support in Daytona and native codex-acp flows:
    • auto-inherits CODEX_ACCESS_TOKEN / CODEX_API_KEY
    • maps CODEX_API_KEY to OPENAI_API_KEY for native Codex auth writing
    • keeps subscription/auth-token paths out of custom OpenAI-compatible endpoints
  • Auto-loads .env for CLI execution so users do not need set -a && source .env && set +a for common provider variables.
  • Improves provider/base-url forwarding for agent containers and custom endpoints.
  • Fixes skill double-deploy behavior for Dockerfile-injected tasks.
  • Hardens shell usage in scene/snapshot helpers with quoted paths and path traversal validation.
  • Fixes Modal/Daytona/runtime edge cases found in dogfood, including Modal optional dependencies, sandbox-user normalization, remote repo helper imports, and eval-list/skill-eval summary handling.

Public PRs: #285, #286, #294, #296. Original hardening/fix PRs: #230, #242.

Benchmarks and integrations

  • Carries forward the Harvey LAB adapter/converter work and aligns benchmark runners with the v0.4 Rollout/Sandbox API.
  • Updates ProgramBench integration to the new source layout and v0.4 runner conventions.
  • Adds integration-suite release readiness checks and adapter evidence tooling.
  • Updates conformance scripts and example scripts for the new auth, sandbox, and CLI conventions.

Public PRs: #239, #237, #274, #294, #296.

Validation

Release artifact validation from the clean v0.4.0 tag worktree:

uv build
uv run --extra dev python -m pytest tests/test_hosted_env.py tests/test_reexport.py tests/test_yaml_config.py
uv run ruff check .
uv run ty check src/

Result: build succeeded, focused release tests passed (31 passed), ruff passed, and ty passed.

GitHub Actions on main for the merge commit passed.

Hosted PrimeIntellect / Verifiers smoke passed with explicit provider sampling args:

uv run bench eval create \
  --source-env primeintellect/general-agent \
  --source-env-version 0.1.1 \
  --source-env-arg task=calendar_scheduling_t0 \
  --agent gemini \
  --model google/gemini-2.5-flash-lite \
  --source-env-sampling-arg reasoning_effort=minimal

Result: reward 1.0, tool calls 2, no Verifiers error.

A smoke without reasoning_effort=minimal still completed the harness path but scored 0.0; provider/model-specific sampling options are now explicit and should be passed when needed.

Merged PRs

  • #274 — v0.4 architecture consolidation: unified types, Rollout execution path, Sandbox protocol, rewards, agent capabilities, Inspect/ORS adapters, docs cleanup. Consolidates #261, #262, #265, #266, #268, and #271.
  • #285 — backports skills double-deploy and shell-injection hardening from #230 and #242.
  • #294 — v0.4 refactor integration on main: module consolidation, package extras, .env loading, trace import, task helpers, docs, skill eval assets, sandbox backends.
  • #296 — Codex subscription/access-token auth support in Daytona and native Codex flows.
  • #290 — hosted PrimeIntellect / Verifiers environment source adapter and bench environment hosted-hub commands.

Compatibility notes

  • The old Trial/Job names are still available as aliases, but new code should use Rollout/Evaluation terminology.
  • --sandbox is the canonical sandbox flag.
  • Short CLI flags were removed; use explicit long flags.
  • Hosted environments are provider-owned harnesses. BenchFlow records and orchestrates them, but --sandbox remains for native BenchFlow task sources.

v0.3.2 — BaseUser, verifier hardening, DinD compose

25 Apr 10:55
23b4de4

Choose a tag to compare

Highlights

  • BaseUser progressive-disclosure abstraction (#194): Python callback drives multi-round agent runs. Built for SWE-bench Pro use case (Josh @ GitHub) and as parity answer to Harbor #1316 in the no-second-LLM case. See docs/progressive-disclosure.md.
  • Per-task [verifier.hardening] opt-outs in task.toml (#194): tasks with legitimate conftest.py setups (qutebrowser-style) opt out of specific cleanup steps. Achieves 5/5 SWE-bench Pro oracle on hardened verifier.
  • DinD compose ACP via Daytona PTY WebSocket (#193, #196): live agent pipes for SkillsBench / DinD compose tasks.
  • --rootdir=/app in PYTEST_ADDOPTS (#194): anchors test node IDs to repo root; openlibrary oracle goes 0/18 → 18/18.

Fixes

  • cfg.agent_env reaches connect_as() (#191, closes #190): YAML-supplied provider creds now reach the agent.
  • DinD env-file path mismatch (#198): shlex.join() was quoting $$ literally so written/read paths diverged; switched to uuid.uuid4() for unique paths.
  • OpenHands sandbox launch + ACP CLI path (#182).
  • Stop copying root tool installs into sandbox home (#181, closes #178).
  • sandbox_setup_timeout wired through configs (#180).

Chores

  • Repo-wide ruff lint debt cleanup (#197): 126 errors → 0.
  • Docs: uv tool install (#176).

SWE-bench Pro validation

Oracle 5/5 on Daytona (ansible, flipt, openlibrary, navidrome, qutebrowser).
Single-round Gemini 3.1 Pro baseline: 2/4.

Install

pip install benchflow==0.3.2

v0.2.3 — verifier hardening follow-ups

16 Apr 00:14
0e5c9e9

Choose a tag to compare

Added

  • benchmarks/tb2_multiturn-claude-haiku45.yaml — shipped config for the README's TB2 multi-turn Claude result.
  • Daytona resource clamping via BENCHFLOW_DAYTONA_MAX_CPUS / MAX_MEMORY_MB.

Changed

  • Renamed skillsbench-claude-glm5.yamlskillsbench-claude-glm51.yaml to match the model ID.
  • codex --login correction in docs/getting-started.md.
  • Restricted sdist build to src/, tests/, and metadata.

Fixed

  • Verifier sandbox hardening follow-ups across several base-image and tooling edge cases.
  • Preserve trusted verifier path entries and workspace answer files.
  • Redirect oracle output to container log.
  • Align YAML path resolution to config file location.

v0.2.2 — BenchJack defenses: 32.6% → 0.15% exploit rate on 666 tasks

14 Apr 18:49
540b011

Choose a tag to compare

Added

  • Sandbox hardening tiers 1–4 — layered defense blocking F1–F6 red-team findings: env scrubbing, path lockdown, workspace freeze, wider build-config snapshot, oracle privilege drop, and a filesystem scrub that deletes injected conftest.py / .pth / sitecustomize.py files outside /tests plus *.py drops in /tmp and /var/tmp before every verifier run.
  • labs/reward-hack-matrix — end-to-end sweep of BenchJack-shaped exploits against 666 real tasks across skillsbench, swebench-verified, and terminal-bench-2 (1332 trials). Per-cell results in sweep_0.2.0_vs_0.2.2.json. Includes per-trial timeout and a long-lived worker pool that keeps local RAM ~1 GB regardless of trial count.

Fixed

  • Multiple sandbox bypass vectors identified in red-team testing (F1–F6).
  • PYTHONHOME="" crashing Py_Initialize — empty value is NOT equivalent to unset; dropped from VERIFIER_ENV.
  • PYTHONSAFEPATH=1 breaking matplotlib setupext.py imports — dropped from VERIFIER_ENV.
  • pytest_plugins AttributeError during hardened verify — guarded with getattr(...).
  • matplotlib LFS EOVERFLOW on qhull build artifacts — replaced rmtree + copytree fallback with shutil.copytree(dirs_exist_ok=True) merge-copy in harden_before_verify so inert overlay stragglers no longer block workspace restore.

Results

BenchJack-shaped exploit success rate on 666 real tasks:

benchmark tasks 0.2.0 EXPLT 0.2.2 EXPLT Δ
skillsbench 77 16 (20.8%) 0 (0%) −20.8 pp
swebench-verified 500 119 (23.8%) 1 (0.2%)¹ −23.6 pp
terminal-bench-2 89 82 (92.1%) 0 (0%) −92.1 pp
total 666 217 (32.6%) 1 (0.15%) −32.4 pp

¹ swebench-verified/django__django-7530 scores reward = 1.0 on BOTH versions because its FAIL_TO_PASS test passes at baseline without any patch — a SWE-bench task-definition quirk, not a 0.2.2 bypass. True bypass count on 0.2.2: 0.

Install: pip install benchflow==0.2.2

Full audit: labs/reward-hack-matrix/ — Hardening design: .dev-docs/harden-sandbox.md

v0.2.1 — Sandbox hardening on by default

14 Apr 18:49
27b5139

Choose a tag to compare

Added

  • Sandbox hardening on by defaultsandbox_user now defaults to "agent" (was None/root). Blocks conftest-hook and answer-lookup exploit patterns.
  • Path lockdown — new sandbox_locked_paths parameter makes /solution and /tests read-only before the verifier runs, blocking .pth-injection and similar pre-verify tampering.
  • Verifier failure isolation — agent errors and verifier errors are now stored separately; a crashing verifier no longer masks the agent result.
  • labs/benchjack-sandbox-hardening — cookbook demonstrating three exploit patterns (P1 conftest-hook, P2 answer-lookup, P7 .pth-injection) and their defenses.

Fixed

  • Oracle runs as sandbox_user — oracle agent now respects path lockdown instead of running as root and bypassing it.
  • Multi-endpoint provider routing — providers with multiple endpoints now route by the agent's native API protocol.
  • Stale API key shadowing subscription auth — emits a warning when ANTHROPIC_API_KEY env var is present alongside claude login credentials.
  • pytest ini-injection bypass — closed a verifier hardening edge case.

Changed

  • Version is now single-sourced via importlib.metadata; no more duplicate version string in __init__.py.
  • User-facing docs — new docs/ directory with getting-started guide, CLI reference, architecture overview, task-authoring guide, and labs index. README trimmed; detailed content moved to docs/.

Install: pip install benchflow==0.2.1

v0.2.0 — First Public Release

09 Apr 22:27
01ee396

Choose a tag to compare

First public release 🎉

Major rearchitecture from the 0.1.x era. API surface has changed — assume breaking changes. Future releases will maintain compatibility within the 0.2.x line.

Install

pip install benchflow

Highlights

  • Multi-agent, multi-provider, multi-auth matrix — 12 end-to-end tested agent × model × provider × auth combinations (docs/tested-agents.md)
  • Subscription auth — use claude login, codex --login, gemini OAuth directly, no API keys required
  • Vertex AI support — ADC auth for google-vertex/, anthropic-vertex/, vertex-zai/ prefixed models
  • Provider registry — data-driven custom LLM endpoints via ProviderConfig. Adding a new provider = single registry entry
  • vLLM supportbenchflow run -a pi-acp --model vllm/Qwen3-80B --ae BENCHFLOW_PROVIDER_BASE_URL=...
  • SDK refactorSDK.run() decomposed into focused private methods; core modules extracted (_models.py, _trajectory.py, _env_setup.py, _scoring.py)
  • Harbor switched to PyPIharbor==0.3.0 pin, no more git URL dependency
  • benchmarks/ directory with reusable YAML configs and runner scripts for TB2 and SkillsBench
  • benchflow tasks init / tasks check commands for scaffolding and validating new tasks
  • Oracle agent support — run solution/solve.sh directly for task validation
  • 232 unit tests (up from 66 in 0.1.x)

Benchmark Results

Benchmark Agent Model Score
TB2 single-turn codex-acp GPT-5.4 69.7% (62/89)
TB2 single-turn claude-agent-acp Sonnet 4.6 58.4% (52/89)
TB2 multi-turn codex-acp GPT-5.4 62.9% (56/89)
TB2 multi-turn claude-agent-acp Haiku 4.5 37.1% (33/89)
SkillsBench codex-acp GPT-5.4 37.2% (32/86)

Notable finding: Multi-turn self-critique hurts capable models (GPT-5.4 regresses −6.8pp from single-turn) but helps weaker models (Haiku 4.5 gains +9.6pp).

Security fixes

  • API keys no longer leak in ps aux
  • ADC credentials fixed for sandbox_user setups (#111)
  • Daytona sandbox orphan cleanup with --max-age filter (#102)
  • litellm upgraded to 1.83.0 for CVE-2026-35030
  • cryptography upgraded to 46.0.7 for CVE-2026-39892
  • 13 transitive Dependabot alerts resolved

Contributors

  • @kywch (Kyoung Whan Choe) — core refactor, benchmark runs, agent test scripts
  • @xdotli (Xiangyi Li) — SDK, providers, Vertex AI, OpenClaw shim, subscription auth

Full changelog

https://github.com/benchflow-ai/benchflow/blob/main/CHANGELOG.md

Migration from 0.1.x

0.1.x users should treat this as a fresh install. The SDK API, CLI, registry pattern, and task format have all changed. There is no automatic migration path. See docs/sdk-reference.md and examples/ for starting points.

v0.1.12

07 Mar 18:04
c407dfd

Choose a tag to compare

What's Changed

  • add crag benchmark by @danielfang001 in #12
  • fix: remove binary code by @kk-xuhj in #13
  • Fix/crag by @xdotli in #14
  • Feat/hot load by @xdotli in #16
  • Docs/readme by @kk-xuhj in #17
  • docs: udpate readme by @kk-xuhj in #18
  • Fix/demo agents by @kk-xuhj in #19

New Contributors

Full Changelog: v0.1.8...v0.1.12

v0.1.8

27 Feb 02:36
a177063

Choose a tag to compare

What's Changed

  • Feat/v0.1.5 by @kk-xuhj in #1
  • Feat/new_interface_for_benchmark by @kk-xuhj in #2
  • add swebench by @kk-xuhj in #3
  • Readme suggestion by @tom-doerr in #4
  • Feat/bff integrate by @kk-xuhj in #6
  • Docs/better docs by @kk-xuhj in #7
  • feat: add MMLU-PRO by @kk-xuhj in #9
  • Feat/type_checking by @kk-xuhj in #10

New Contributors

  • @kk-xuhj made their first contribution in #1
  • @tom-doerr made their first contribution in #4

Full Changelog: https://github.com/benchflow-ai/benchflow/commits/v0.1.8