Skip to content

Releases: PrimeIntellect-ai/verifiers

v0.1.15.dev8

22 May 08:17
a608e00

Choose a tag to compare

Verifiers v0.1.15.dev8 Release Notes

Date: 05/22/2026

Highlights since v0.1.15.dev7

  • v1 Taskset/Harness loading is stricter and more package-native. v1 configs now keep taskset and harness child config objects, package component loading is explicit, and the BYO Harness docs, generated agent guidance, and create-environments skill describe reusable tasksets and harnesses as the golden path.
  • Eval configuration defaults are cleaner. TOML-backed eval runs save results by default, sampling sections forward reasoning kwargs to clients, and stale pre-eval install docs were removed.
  • Rollout and client behavior is more robust. SWE rollouts with sandbox errors skip scoring, save deltas reset cleanly on non-monotonic trajectories, renderer overlong-prompt failures normalize to vf.OverlongPromptError, and routed expert responses can carry sidecar data.
  • Developer tooling is tighter. The eval viewer TUI was modernized, the legacy vf-tui entrypoint was retired, renderer client error handling was tightened, and the router replay performance experiment was reverted.

Changes included in v0.1.15.dev8 (since v0.1.15.dev7)

v1 Taskset/Harness

  • Rework v1 harness and taskset config classes (#1392)
  • Add strict package loaders for v1 tasksets and harnesses (#1429)

Evals and rollouts

  • Modernize eval viewer TUI (#1393)
  • Forward reasoning kwargs from eval TOML sampling sections (#1404)
  • Skip SWE sandbox scoring for errored rollouts (#1412)
  • Support routed experts response sidecar (#1423)
  • Remove stale pre-eval prime env install docs (#1434)
  • Default TOML eval runs to save results (#1435)

Runtime and client fixes

  • chore: bump renderers to 0.1.8.dev2 (supersedes #1366) (#1395)
  • fix(save_utils): reset delta baseline on non-monotonic trajectories (#1400)
  • fix(renderer-client): translate renderers.OverlongPromptError into vf.OverlongPromptError and require renderers>=0.1.8.dev4 (#1408)
  • [Router Replay]: Improve performance by removing Pydantic validation (#1394)
  • Revert "[Router Replay]: Improve performance by removing Pydantic validation" (#1422)

Tooling and docs

  • Retire vf-tui entrypoint (#1398)

Full Changelog: v0.1.15.dev7...v0.1.15.dev8

v0.1.15.dev7

15 May 05:48
ffec6bc

Choose a tag to compare

Verifiers v0.1.15.dev7 Release Notes

Date: 05/15/2026

Highlights since v0.1.15.dev6

  • PyPI release pipeline moved to Trusted Publishing. The Tag and Release workflow now publishes via PyPA OIDC, gated on the pypi-prod GitHub environment, with build, publish, and GitHub-release split into separate jobs so build-time code never sees the publishing identity. The long-lived PYPI_TOKEN is no longer used to release verifiers.
  • Per-eval naming for duplicate environment runs. prime eval run now accepts per-eval name labels so duplicate environments produce distinct, disambiguated runs.

Changes included in v0.1.15.dev7 (since v0.1.15.dev6)

Release infrastructure

  • improvement: switch verifiers PyPI publish to OIDC trusted publishing (#1386)

Evals

  • Add per-eval names for duplicate environment runs (#1384)
  • Document eval name labels in skill (#1388)

Docs

  • Add AGENTS.local guidance to Lab workspace docs (#1383)

Full Changelog: v0.1.15.dev6...v0.1.15.dev7

v0.1.15.dev6

14 May 23:55
c7095fe

Choose a tag to compare

Verifiers v0.1.15.dev6 Release Notes

Date: 05/14/2026

Highlights since v0.1.15.dev5

  • Routed-experts token metadata is handled as an opaque payload. OpenAI-compatible chat and text-completions clients now preserve routed-experts data without expanding it through user-facing message types, and token parsing truncates that sidecar alongside prompt/completion tokens when a rollout is clipped to max_seq_len.
  • v1 EnvConfig typing is stricter. The v1 loader envelope, config docs, generated environment templates, examples, and tests now keep environment-level configuration narrow while routing taskset- and harness-owned fields through their concrete config classes.

Changes included in v0.1.15.dev6 (since v0.1.15.dev5)

Tokens and clients

  • Consume split routed-experts payloads in OpenAI-compatible clients (#1363)
  • Move routed-experts helpers into response utilities (#1373)

v1 config

  • Tighten v1 EnvConfig typing (#1371)

Full Changelog: v0.1.15.dev5...v0.1.15.dev6

v0.1.15.dev5

14 May 15:00
17022c2

Choose a tag to compare

Verifiers v0.1.15.dev5 Release Notes

Date: 05/14/2026

Highlights since v0.1.15.dev4

  • Terminus2 is now a bundled v1 CLI harness. vf.Terminus2 runs Harbor's Terminus 2 agent behind the v1 harness boundary, is exported from verifiers.v1 and the root verifiers namespace, uses OpenAI-compatible proxy defaults, and captures optional Terminus2 logs as artifacts.
  • Harbor sandbox defaults stay explicit. Harbor tasksets and task freezing now preserve only task-provided sandbox settings, including network_access only when task config explicitly sets internet access.

Changes included in v0.1.15.dev5 (since v0.1.15.dev4)

v1 harnesses and tasksets

  • Add Terminus2 harness (#1356)

Full Changelog: v0.1.15.dev4...v0.1.15.dev5

v0.1.15.dev4

14 May 09:28
51432d6

Choose a tag to compare

Verifiers v0.1.15.dev4 Release Notes

Date: 05/14/2026

Highlights since v0.1.15.dev3

  • v1 config and binding surfaces are stricter and more explicit. v1 tasksets, harnesses, users, toolsets, env args, prompts, tools, and runtime handles now use named Pydantic-backed types and protocols instead of loose object-shaped config. Direct object binding is rejected so private runtime objects stay behind object loaders and explicit binding boundaries.
  • v1 rollout, lifecycle, and taskset utilities are consolidated. Shared config, binding, lifecycle, runtime registry, serialization, task freezing, taskset, usage, sandbox, and program plumbing now live in dedicated utility modules. Agent text extraction and rollout extraction stay explicit, and empty completions are handled cleanly in v1 reward paths.
  • Environment examples and docs now follow the current v1 contract. BFCL, RLM, subagent, Harbor/OpenCode, OpenEnv, wiki-search, tau2, and related examples have been updated for the current config and message-access patterns. OpenEnv remains optional at import time, OpenCode Harbor defaults are documented, and reusable v1 authoring guidance now lives in the BYO Harness docs.
  • Eval runs can show estimated Prime cost. prime eval run fetches Prime Inference pricing, records metadata.cost when usage and pricing are available, and renders cost in live eval displays, final summaries, saved-results TUI headers, and non-TUI usage output.
  • Repo and agent guidance reflects the current contributor workflow. AGENTS, docs, and environment skills now describe the canonical prime eval run path, public config expectations, taskset/harness ownership, downstream-consumer checks, and focused validation defaults.

Changes included in v0.1.15.dev4 (since v0.1.15.dev3)

v1 environment model

  • Tighten v1 config, bindings, and handler typing (#1362)

Evaluation UX

  • [codex] Display eval cost from Prime pricing (#1368)

Docs, guidance, and maintenance

  • Add repo development best practices to AGENTS, docs, and skills (#1361)

Full Changelog: v0.1.15.dev3...v0.1.15.dev4

v0.1.15.dev3

13 May 19:30
c84fcd2

Choose a tag to compare

Verifiers v0.1.15.dev3 Release Notes

Date: 05/13/2026

Highlights since v0.1.15.dev2

  • OverlongPromptError no longer surfaces as a spurious interception error. CliAgentEnv / ComposableEnv rollouts that overflow the model context now report a clean stop=prompt_too_long without an accompanying InterceptionError in the eval errors summary. The interception server's first-error-wins guard now also short-circuits once the rollout loop has finalized via state["prompt_too_long"], so tail-end failures (e.g. write_eof to an agent that already closed its transport) can no longer re-attach a spurious StreamInterrupted on top of the real stop signal.
  • v1 sandbox commands run as background jobs. v1 sandbox program commands and Harbor verifier tests now run through run_background_job instead of foreground execute_command, preserving configured command/test timeouts as the background-job budget.

Changes included in v0.1.15.dev3 (since v0.1.15.dev2)

Fixes and maintenance

  • fix: treat OverlongPromptError as stop condition in interception proxy (#1365)
  • Use background jobs for v1 sandbox commands (#1364)

Full Changelog: v0.1.15.dev2...v0.1.15.dev3

v0.1.15.dev2

13 May 00:08
e0575dc

Choose a tag to compare

Verifiers v0.1.15.dev2 Release Notes

Date: 05/13/2026

Highlights since v0.1.15.dev1

  • OpenCode title-generation bypass. Disables the hidden OpenCode title agent in the legacy, composable, and v1 OpenCode harness configs while keeping the small_model pin to avoid falling back to the default small model and hitting rate limits.

Changes included in v0.1.15.dev2 (since v0.1.15.dev1)

Fixes and maintenance

  • Disable OpenCode title generation in harness configs (#1359)

Full Changelog: v0.1.15.dev1...v0.1.15.dev2

v0.1.15.dev1

12 May 20:19
493fa16

Choose a tag to compare

Verifiers v0.1.15.dev1 Release Notes

Date: 05/12/2026

Highlights since v0.1.15.dev0

  • Renderer-native multimodal transport. Threads renderer-emitted multi_modal_data through rollout generation, trajectory tokens, msgpack transport, and save output, with per-step delta encoding to avoid duplicating image payloads across rollout windows.
  • Provider-specific reasoning controls. Maps reasoning_effort correctly for Anthropic models across native Anthropic calls and OpenAI-compatible Anthropic routes, including adaptive thinking defaults for newer Claude models.
  • Runtime and release hardening. Removes the fixed ZMQ env request timeout, tightens v1 lifecycle handler discovery, and adds BugBot rules that keep release metadata and dependency sources publishable.

Changes included in v0.1.15.dev1 (since v0.1.15.dev0)

Features and enhancements

  • feat(renderer-client): thread multimodal sidecar through rollout + transport (#1346)
  • Map Anthropic reasoning effort by provider (#1338)

Fixes and maintenance

  • Fix ZMQ env timeouts (#1355)
  • Fix v1 lifecycle handler discovery (#1347)
  • Strengthen BugBot release and dependency rules (#1354)
  • Tighten BugBot releasability and dependency rules (#1353)

Full Changelog: v0.1.15.dev0...v0.1.15.dev1

v0.1.15.dev0

12 May 01:29
a950819

Choose a tag to compare

Verifiers v0.1.15.dev0 Release Notes

Date: 05/12/2026

Highlights since v0.1.14

  • v1 taskset/harness follow-through. Refreshes the docs and examples around the v1 taskset/harness path, completes the opencode_harbor migration, adds setup-state compatibility for environments, and makes sandbox worker caps configurable.
  • Agent and SWE environment updates. Adds the experimental SWE debug environment, moves OpenCode configuration into reusable harness code, refactors the LangChain Deep Agents example to a v1 Wikispeedia taskset, and removes the direct prime-sandboxes dependency from rlm_swe_v1.
  • Interception, sandbox, and renderer hardening. Requires auth for interception endpoints, surfaces swallowed interception response errors, fixes keepalive configuration, avoids holding sandbox workers while background jobs run, retries transient file-read timeouts, and aligns renderer tool envelopes with the training distribution.

Changes included in v0.1.15.dev0 (since v0.1.14)

Features and enhancements

  • feat(composable): render pre-compaction branches via harness hook (#1291)
  • Make sandbox worker caps configurable (#1305)
  • Support setup_state return-state compatibility (#1308)
  • Update v1 Taskset/Harness docs and opencode_harbor migration (#1309)
  • Refactor deep-agents envs to v1 taskset-harness and add Wikispeedia example (#1317)
  • Add SWE debug environment (#1306)
  • Move OpenCode config into reusable harness package (#1318)
  • Remove direct prime-sandboxes dependency from rlm_swe_v1 (#1316)
  • Install sandbox fn program packages (#1319)
  • opencode: write config under XDG_CONFIG_HOME (#1320)
  • feat: surface swallowed response_future errors in interception proxy (#1196)
  • opencode: pin small_model to the intercepted provider (#1323)
  • Avoid holding sandbox workers with run_background_job() (#1328)
  • Skip OpenCode install when binary is pre-baked (#1176)

Fixes and maintenance

  • Require auth for interception endpoints (#1304)
  • fix: mute httpcore/httpx DEBUG when env-worker json_logging sets root level (#1234)
  • bump interception client size to handle trajectories with images (#1310)
  • Fix eval summary for heterogeneous metrics (#1295)
  • apt: harden sandbox bootstrap against transient archive.ubuntu.com flakes (#1284)
  • fix(renderer-client): wrap tools in OpenAI envelope to match training distribution (#1307)
  • fixes rlm compaction prompt so we still get tito hit (#1322)
  • Fix interception keepalive env override (#1321)
  • fix: narrow send_cancel BaseException catch to Exception (#1198)
  • Fix Harbor Hub task root handling (#1345)
  • Pin OpenCode small model in legacy configs (#1327)
  • Remove local training config scaffolding from Verifiers (#1348)
  • retry Read file timed out (#1350)

Full Changelog: v0.1.14...v0.1.15.dev0

v0.1.14

07 May 06:12
2fbd2b7

Choose a tag to compare

Verifiers v0.1.14 Release Notes

Date: 05/07/2026

Highlights since v0.1.13.dev8

  • Composable v1 Taskset/Harness API. Adds the verifiers.v1 authoring surface around serializable Task/State data, composable Taskset and Harness objects, and the vf.Env(taskset, harness) adapter for existing eval and training workers. The release includes lifecycle decorators, typed config objects, endpoint routing, toolsets, MCP tools, sandbox/program utilities, nested harness support, v1 docs, migration notes, and several v1 example environments, including new OpenAI Agents, LangChain Deep Agents, and DSPy RLM harness examples.
  • Consistent v1 environment configuration. Eval and RL/Hosted Training TOML now share the same public projection shape through [*.args], [*.taskset], and [*.harness] sections. v1 loaders accept both mapping and model-backed config objects through a common access helper, and strict child config parsing strips loader-local routing keys at the boundary.
  • Model-family starter configs. Restructures bundled eval, RL, and GEPA starter configs around model families such as Qwen 3.5, Qwen 3.5 MoE, Nemotron 3, and Llama 3, with setup mirroring the new config set into Lab workspaces.
  • New client and rendering paths. Adds an OpenAI Responses API client and a renderer-backed client path for exact token rendering and multi-turn bridge metrics. The renderer implementations now live in the external renderers package, exposed from Verifiers through an optional renderers extra and client integration. Renderer clients also forward preserve_all_thinking and preserve_thinking_between_tool_calls flags into the underlying renderer.
  • More rollout observability and artifacts. Adds per-turn timing through eval outputs and TUI display, token-id preservation for Nemotron client responses, GEPA system-prompt artifact export plus path-based prompt loading, and Lean guard markers with tamper-aware LeanRubric scoring.
  • Release and infrastructure hardening. Adds universal locks and a 7-day PyPI freshness cooldown, scopes Hub install freshness filtering to registry packages, skips secret-backed environment tests on fork PRs, points the composable RLM harness at rlm-harness, and routes opencode AGENT_WORKDIR per rollout.

Changes included in v0.1.14 (since v0.1.13.dev8)

Features and enhancements

  • Taskset Harness (v1) (#1277)
  • ApiEnv examples for OpenAI Agents, LangChain Deep Agents, and DSPy RLM (#1121)
  • Refactor tau2 bench into a taskset-owned v1 environment (#1293)
  • Restructure example configs around model families (#1297)
  • add openai responses client (#1261)
  • Renderer-backed client integration via the external renderers package (#1068, #1279, #1282)
  • feat(renderer-client): forward preserve_*_thinking config flags (#1298)
  • feat: per-turn timing (#1182)
  • Add GEPA system prompt export and path-based prompt loading (#1268)
  • feat(lean): lean-guard markers + tamper-aware LeanRubric (#1271)
  • token id support for Nemotron client responses (#1231)

Fixes and maintenance

  • Fix v1 env config projection and typed child loader boundaries (#1294)
  • Skip secret-backed environment tests for fork PRs (#1292)
  • Scope uv freshness filtering to PyPI for Hub installs (#1286)
  • opencode harness: route AGENT_WORKDIR per-rollout instead of baked-in (#1280)
  • chore: add 7-day supply chain cooldown via uv exclude-newer (#1274)
  • chore: point DEFAULT_RLM_REPO_URL to rlm-harness (#1267)
  • Update Lab workspace setup guidance (#1299)

Full Changelog: v0.1.13.dev8...v0.1.14