Releases: PrimeIntellect-ai/verifiers
Releases · PrimeIntellect-ai/verifiers
v0.1.15.dev8
Verifiers v0.1.15.dev8 Release Notes
Date: 05/22/2026
Highlights since v0.1.15.dev7
- v1 Taskset/Harness loading is stricter and more package-native. v1 configs now keep taskset and harness child config objects, package component loading is explicit, and the BYO Harness docs, generated agent guidance, and create-environments skill describe reusable tasksets and harnesses as the golden path.
- Eval configuration defaults are cleaner. TOML-backed eval runs save results by default, sampling sections forward reasoning kwargs to clients, and stale pre-eval install docs were removed.
- Rollout and client behavior is more robust. SWE rollouts with sandbox errors skip scoring, save deltas reset cleanly on non-monotonic trajectories, renderer overlong-prompt failures normalize to
vf.OverlongPromptError, and routed expert responses can carry sidecar data. - Developer tooling is tighter. The eval viewer TUI was modernized, the legacy
vf-tuientrypoint was retired, renderer client error handling was tightened, and the router replay performance experiment was reverted.
Changes included in v0.1.15.dev8 (since v0.1.15.dev7)
v1 Taskset/Harness
- Rework v1 harness and taskset config classes (#1392)
- Add strict package loaders for v1 tasksets and harnesses (#1429)
Evals and rollouts
- Modernize eval viewer TUI (#1393)
- Forward reasoning kwargs from eval TOML sampling sections (#1404)
- Skip SWE sandbox scoring for errored rollouts (#1412)
- Support routed experts response sidecar (#1423)
- Remove stale pre-eval
prime env installdocs (#1434) - Default TOML eval runs to save results (#1435)
Runtime and client fixes
- chore: bump renderers to 0.1.8.dev2 (supersedes #1366) (#1395)
- fix(save_utils): reset delta baseline on non-monotonic trajectories (#1400)
- fix(renderer-client): translate renderers.OverlongPromptError into vf.OverlongPromptError and require renderers>=0.1.8.dev4 (#1408)
- [Router Replay]: Improve performance by removing Pydantic validation (#1394)
- Revert "[Router Replay]: Improve performance by removing Pydantic validation" (#1422)
Tooling and docs
- Retire vf-tui entrypoint (#1398)
Full Changelog: v0.1.15.dev7...v0.1.15.dev8
v0.1.15.dev7
Verifiers v0.1.15.dev7 Release Notes
Date: 05/15/2026
Highlights since v0.1.15.dev6
- PyPI release pipeline moved to Trusted Publishing. The
Tag and Releaseworkflow now publishes via PyPA OIDC, gated on thepypi-prodGitHub environment, with build, publish, and GitHub-release split into separate jobs so build-time code never sees the publishing identity. The long-livedPYPI_TOKENis no longer used to releaseverifiers. - Per-eval naming for duplicate environment runs.
prime eval runnow accepts per-eval name labels so duplicate environments produce distinct, disambiguated runs.
Changes included in v0.1.15.dev7 (since v0.1.15.dev6)
Release infrastructure
- improvement: switch verifiers PyPI publish to OIDC trusted publishing (#1386)
Evals
- Add per-eval names for duplicate environment runs (#1384)
- Document eval name labels in skill (#1388)
Docs
- Add AGENTS.local guidance to Lab workspace docs (#1383)
Full Changelog: v0.1.15.dev6...v0.1.15.dev7
v0.1.15.dev6
Verifiers v0.1.15.dev6 Release Notes
Date: 05/14/2026
Highlights since v0.1.15.dev5
- Routed-experts token metadata is handled as an opaque payload. OpenAI-compatible chat and text-completions clients now preserve routed-experts data without expanding it through user-facing message types, and token parsing truncates that sidecar alongside prompt/completion tokens when a rollout is clipped to
max_seq_len. - v1
EnvConfigtyping is stricter. The v1 loader envelope, config docs, generated environment templates, examples, and tests now keep environment-level configuration narrow while routing taskset- and harness-owned fields through their concrete config classes.
Changes included in v0.1.15.dev6 (since v0.1.15.dev5)
Tokens and clients
- Consume split routed-experts payloads in OpenAI-compatible clients (#1363)
- Move routed-experts helpers into response utilities (#1373)
v1 config
- Tighten v1
EnvConfigtyping (#1371)
Full Changelog: v0.1.15.dev5...v0.1.15.dev6
v0.1.15.dev5
Verifiers v0.1.15.dev5 Release Notes
Date: 05/14/2026
Highlights since v0.1.15.dev4
- Terminus2 is now a bundled v1 CLI harness.
vf.Terminus2runs Harbor's Terminus 2 agent behind the v1 harness boundary, is exported fromverifiers.v1and the rootverifiersnamespace, uses OpenAI-compatible proxy defaults, and captures optional Terminus2 logs as artifacts. - Harbor sandbox defaults stay explicit. Harbor tasksets and task freezing now preserve only task-provided sandbox settings, including
network_accessonly when task config explicitly sets internet access.
Changes included in v0.1.15.dev5 (since v0.1.15.dev4)
v1 harnesses and tasksets
- Add Terminus2 harness (#1356)
Full Changelog: v0.1.15.dev4...v0.1.15.dev5
v0.1.15.dev4
Verifiers v0.1.15.dev4 Release Notes
Date: 05/14/2026
Highlights since v0.1.15.dev3
- v1 config and binding surfaces are stricter and more explicit. v1 tasksets, harnesses, users, toolsets, env args, prompts, tools, and runtime handles now use named Pydantic-backed types and protocols instead of loose object-shaped config. Direct object binding is rejected so private runtime objects stay behind object loaders and explicit binding boundaries.
- v1 rollout, lifecycle, and taskset utilities are consolidated. Shared config, binding, lifecycle, runtime registry, serialization, task freezing, taskset, usage, sandbox, and program plumbing now live in dedicated utility modules. Agent text extraction and rollout extraction stay explicit, and empty completions are handled cleanly in v1 reward paths.
- Environment examples and docs now follow the current v1 contract. BFCL, RLM, subagent, Harbor/OpenCode, OpenEnv, wiki-search, tau2, and related examples have been updated for the current config and message-access patterns. OpenEnv remains optional at import time, OpenCode Harbor defaults are documented, and reusable v1 authoring guidance now lives in the BYO Harness docs.
- Eval runs can show estimated Prime cost.
prime eval runfetches Prime Inference pricing, recordsmetadata.costwhen usage and pricing are available, and renders cost in live eval displays, final summaries, saved-results TUI headers, and non-TUI usage output. - Repo and agent guidance reflects the current contributor workflow. AGENTS, docs, and environment skills now describe the canonical
prime eval runpath, public config expectations, taskset/harness ownership, downstream-consumer checks, and focused validation defaults.
Changes included in v0.1.15.dev4 (since v0.1.15.dev3)
v1 environment model
- Tighten v1 config, bindings, and handler typing (#1362)
Evaluation UX
- [codex] Display eval cost from Prime pricing (#1368)
Docs, guidance, and maintenance
- Add repo development best practices to AGENTS, docs, and skills (#1361)
Full Changelog: v0.1.15.dev3...v0.1.15.dev4
v0.1.15.dev3
Verifiers v0.1.15.dev3 Release Notes
Date: 05/13/2026
Highlights since v0.1.15.dev2
- OverlongPromptError no longer surfaces as a spurious interception error.
CliAgentEnv/ComposableEnvrollouts that overflow the model context now report a cleanstop=prompt_too_longwithout an accompanyingInterceptionErrorin the evalerrorssummary. The interception server's first-error-wins guard now also short-circuits once the rollout loop has finalized viastate["prompt_too_long"], so tail-end failures (e.g.write_eofto an agent that already closed its transport) can no longer re-attach a spuriousStreamInterruptedon top of the real stop signal. - v1 sandbox commands run as background jobs. v1 sandbox program commands and Harbor verifier tests now run through
run_background_jobinstead of foregroundexecute_command, preserving configured command/test timeouts as the background-job budget.
Changes included in v0.1.15.dev3 (since v0.1.15.dev2)
Fixes and maintenance
- fix: treat OverlongPromptError as stop condition in interception proxy (#1365)
- Use background jobs for v1 sandbox commands (#1364)
Full Changelog: v0.1.15.dev2...v0.1.15.dev3
v0.1.15.dev2
Verifiers v0.1.15.dev2 Release Notes
Date: 05/13/2026
Highlights since v0.1.15.dev1
- OpenCode title-generation bypass. Disables the hidden OpenCode title agent in the legacy, composable, and v1 OpenCode harness configs while keeping the
small_modelpin to avoid falling back to the default small model and hitting rate limits.
Changes included in v0.1.15.dev2 (since v0.1.15.dev1)
Fixes and maintenance
- Disable OpenCode title generation in harness configs (#1359)
Full Changelog: v0.1.15.dev1...v0.1.15.dev2
v0.1.15.dev1
Verifiers v0.1.15.dev1 Release Notes
Date: 05/12/2026
Highlights since v0.1.15.dev0
- Renderer-native multimodal transport. Threads renderer-emitted
multi_modal_datathrough rollout generation, trajectory tokens, msgpack transport, and save output, with per-step delta encoding to avoid duplicating image payloads across rollout windows. - Provider-specific reasoning controls. Maps
reasoning_effortcorrectly for Anthropic models across native Anthropic calls and OpenAI-compatible Anthropic routes, including adaptive thinking defaults for newer Claude models. - Runtime and release hardening. Removes the fixed ZMQ env request timeout, tightens v1 lifecycle handler discovery, and adds BugBot rules that keep release metadata and dependency sources publishable.
Changes included in v0.1.15.dev1 (since v0.1.15.dev0)
Features and enhancements
- feat(renderer-client): thread multimodal sidecar through rollout + transport (#1346)
- Map Anthropic reasoning effort by provider (#1338)
Fixes and maintenance
- Fix ZMQ env timeouts (#1355)
- Fix v1 lifecycle handler discovery (#1347)
- Strengthen BugBot release and dependency rules (#1354)
- Tighten BugBot releasability and dependency rules (#1353)
Full Changelog: v0.1.15.dev0...v0.1.15.dev1
v0.1.15.dev0
Verifiers v0.1.15.dev0 Release Notes
Date: 05/12/2026
Highlights since v0.1.14
- v1 taskset/harness follow-through. Refreshes the docs and examples around the v1 taskset/harness path, completes the
opencode_harbormigration, adds setup-state compatibility for environments, and makes sandbox worker caps configurable. - Agent and SWE environment updates. Adds the experimental SWE debug environment, moves OpenCode configuration into reusable harness code, refactors the LangChain Deep Agents example to a v1 Wikispeedia taskset, and removes the direct
prime-sandboxesdependency fromrlm_swe_v1. - Interception, sandbox, and renderer hardening. Requires auth for interception endpoints, surfaces swallowed interception response errors, fixes keepalive configuration, avoids holding sandbox workers while background jobs run, retries transient file-read timeouts, and aligns renderer tool envelopes with the training distribution.
Changes included in v0.1.15.dev0 (since v0.1.14)
Features and enhancements
- feat(composable): render pre-compaction branches via harness hook (#1291)
- Make sandbox worker caps configurable (#1305)
- Support
setup_statereturn-state compatibility (#1308) - Update v1 Taskset/Harness docs and
opencode_harbormigration (#1309) - Refactor deep-agents envs to v1 taskset-harness and add Wikispeedia example (#1317)
- Add SWE debug environment (#1306)
- Move OpenCode config into reusable harness package (#1318)
- Remove direct
prime-sandboxesdependency fromrlm_swe_v1(#1316) - Install sandbox fn program packages (#1319)
- opencode: write config under
XDG_CONFIG_HOME(#1320) - feat: surface swallowed
response_futureerrors in interception proxy (#1196) - opencode: pin
small_modelto the intercepted provider (#1323) - Avoid holding sandbox workers with
run_background_job()(#1328) - Skip OpenCode install when binary is pre-baked (#1176)
Fixes and maintenance
- Require auth for interception endpoints (#1304)
- fix: mute httpcore/httpx DEBUG when env-worker
json_loggingsets root level (#1234) - bump interception client size to handle trajectories with images (#1310)
- Fix eval summary for heterogeneous metrics (#1295)
- apt: harden sandbox bootstrap against transient
archive.ubuntu.comflakes (#1284) - fix(renderer-client): wrap tools in OpenAI envelope to match training distribution (#1307)
- fixes rlm compaction prompt so we still get tito hit (#1322)
- Fix interception keepalive env override (#1321)
- fix: narrow
send_cancelBaseExceptioncatch toException(#1198) - Fix Harbor Hub task root handling (#1345)
- Pin OpenCode small model in legacy configs (#1327)
- Remove local training config scaffolding from Verifiers (#1348)
- retry Read file timed out (#1350)
Full Changelog: v0.1.14...v0.1.15.dev0
v0.1.14
Verifiers v0.1.14 Release Notes
Date: 05/07/2026
Highlights since v0.1.13.dev8
- Composable v1 Taskset/Harness API. Adds the
verifiers.v1authoring surface around serializableTask/Statedata, composableTasksetandHarnessobjects, and thevf.Env(taskset, harness)adapter for existing eval and training workers. The release includes lifecycle decorators, typed config objects, endpoint routing, toolsets, MCP tools, sandbox/program utilities, nested harness support, v1 docs, migration notes, and several v1 example environments, including new OpenAI Agents, LangChain Deep Agents, and DSPy RLM harness examples. - Consistent v1 environment configuration. Eval and RL/Hosted Training TOML now share the same public projection shape through
[*.args],[*.taskset], and[*.harness]sections. v1 loaders accept both mapping and model-backed config objects through a common access helper, and strict child config parsing strips loader-local routing keys at the boundary. - Model-family starter configs. Restructures bundled eval, RL, and GEPA starter configs around model families such as Qwen 3.5, Qwen 3.5 MoE, Nemotron 3, and Llama 3, with setup mirroring the new config set into Lab workspaces.
- New client and rendering paths. Adds an OpenAI Responses API client and a renderer-backed client path for exact token rendering and multi-turn bridge metrics. The renderer implementations now live in the external
rendererspackage, exposed from Verifiers through an optionalrenderersextra and client integration. Renderer clients also forwardpreserve_all_thinkingandpreserve_thinking_between_tool_callsflags into the underlying renderer. - More rollout observability and artifacts. Adds per-turn timing through eval outputs and TUI display, token-id preservation for Nemotron client responses, GEPA system-prompt artifact export plus path-based prompt loading, and Lean guard markers with tamper-aware
LeanRubricscoring. - Release and infrastructure hardening. Adds universal locks and a 7-day PyPI freshness cooldown, scopes Hub install freshness filtering to registry packages, skips secret-backed environment tests on fork PRs, points the composable RLM harness at
rlm-harness, and routes opencodeAGENT_WORKDIRper rollout.
Changes included in v0.1.14 (since v0.1.13.dev8)
Features and enhancements
- Taskset Harness (v1) (#1277)
- ApiEnv examples for OpenAI Agents, LangChain Deep Agents, and DSPy RLM (#1121)
- Refactor tau2 bench into a taskset-owned v1 environment (#1293)
- Restructure example configs around model families (#1297)
- add openai responses client (#1261)
- Renderer-backed client integration via the external
rendererspackage (#1068, #1279, #1282) - feat(renderer-client): forward preserve_*_thinking config flags (#1298)
- feat: per-turn timing (#1182)
- Add GEPA system prompt export and path-based prompt loading (#1268)
- feat(lean): lean-guard markers + tamper-aware LeanRubric (#1271)
- token id support for Nemotron client responses (#1231)
Fixes and maintenance
- Fix v1 env config projection and typed child loader boundaries (#1294)
- Skip secret-backed environment tests for fork PRs (#1292)
- Scope uv freshness filtering to PyPI for Hub installs (#1286)
- opencode harness: route AGENT_WORKDIR per-rollout instead of baked-in (#1280)
- chore: add 7-day supply chain cooldown via uv exclude-newer (#1274)
- chore: point DEFAULT_RLM_REPO_URL to rlm-harness (#1267)
- Update Lab workspace setup guidance (#1299)
Full Changelog: v0.1.13.dev8...v0.1.14