Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions assets/lab/environments/AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -712,8 +712,8 @@ by the framework; do not accept `None` or write `config = config or MyEnvConfig(
knobs on `TasksetConfig` or `HarnessConfig`, not on `EnvConfig` itself.
Environment packages should not subclass `Env`.
Reusable taskset environments can type `harness` as `vf.HarnessConfig`; TOML can
then select a registered harness with names like `type = "terminus2"` or
`type = "pi"` inside the harness table.
then select a registered harness with names like `type = "codex"`,
`type = "claude-code"`, or `type = "pi"` inside the harness table.

The taskset-only shape is:

Expand Down
10 changes: 5 additions & 5 deletions docs/byo-harness.md
Original file line number Diff line number Diff line change
Expand Up @@ -429,9 +429,9 @@ config surface; do not subclass `Env` just to bypass inference.

Packaged CLI harnesses should use the same boundary. These implementations live
under `verifiers.v1.packages` while the v1 surface stabilizes, and are
re-exported through `verifiers.v1`. `OpenCode`, `Pi`, `MiniSWEAgent`,
`Terminus2`, and `RLM` are bundled `Harness` leaf wrappers for common
command-line agents:
re-exported through `verifiers.v1`. `OpenCode`, `ClaudeCode`, `Codex`, `Pi`,
`MiniSWEAgent`, `Terminus2`, and `RLM` are bundled `Harness` leaf wrappers for
common command-line agents:

```python
class HarborEnvConfig(vf.EnvConfig):
Expand Down Expand Up @@ -543,8 +543,8 @@ and harness config types for the loader.

Reusable taskset environments can keep `harness` typed as `vf.HarnessConfig`.
Then TOML may select a registered harness config with `type`, for example
`type = "terminus2"` or `type = "pi"`, and pass that config's ordinary fields
beside it. Use `harness = "pi"` when the selected harness needs no field
`type = "codex"` or `type = "claude-code"`, and pass that config's ordinary
fields beside it. Use `harness = "pi"` when the selected harness needs no field
overrides.

```python
Expand Down
4 changes: 2 additions & 2 deletions docs/environments.md
Original file line number Diff line number Diff line change
Expand Up @@ -705,8 +705,8 @@ by the framework; do not accept `None` or write `config = config or MyEnvConfig(
knobs on `TasksetConfig` or `HarnessConfig`, not on `EnvConfig` itself.
Environment packages should not subclass `Env`.
Reusable taskset environments can type `harness` as `vf.HarnessConfig`; TOML can
then select a registered harness with names like `type = "terminus2"` or
`type = "pi"` inside the harness table.
then select a registered harness with names like `type = "codex"`,
`type = "claude-code"`, or `type = "pi"` inside the harness table.

The taskset-only shape is:

Expand Down
6 changes: 3 additions & 3 deletions docs/evaluation.md
Original file line number Diff line number Diff line change
Expand Up @@ -383,15 +383,15 @@ optional:
| `endpoint_id` | string | Endpoint registry id (requires TOML `endpoints_path`) |

Use `harness.type` to choose a registered v1 harness config for reusable taskset
environments. Bundled names include `opencode`, `mini-swe-agent`, `pi`, `rlm`,
and `terminus2`:
environments. Bundled names include `opencode`, `claude-code`, `codex`,
`mini-swe-agent`, `pi`, `rlm`, and `terminus2`:

```toml
[[eval]]
id = "openthoughts-tblite"

[eval.harness]
type = "terminus2"
type = "codex"
max_turns = 4
```

Expand Down
6 changes: 3 additions & 3 deletions docs/reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -1008,9 +1008,9 @@ Nested config defaults should be explicit config objects, e.g.
`taskset: MyTasksetConfig = MyTasksetConfig()`.

When `harness` is typed as `HarnessConfig`, TOML can select a registered
harness config with `type`, such as `type = "terminus2"` or `type = "pi"`, then
pass the normal fields for that config in the same table. The shorthand form
`harness = "pi"` is also accepted when no fields need to be overridden.
harness config with `type`, such as `type = "codex"` or `type = "claude-code"`,
then pass the normal fields for that config in the same table. The shorthand
form `harness = "pi"` is also accepted when no fields need to be overridden.

`Config` subclasses are strict Pydantic config models. Validate raw mappings
with `MyConfig.model_validate(...)` or use the typed object directly.
Expand Down
4 changes: 2 additions & 2 deletions environments/AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -711,8 +711,8 @@ by the framework; do not accept `None` or write `config = config or MyEnvConfig(
knobs on `TasksetConfig` or `HarnessConfig`, not on `EnvConfig` itself.
Environment packages should not subclass `Env`.
Reusable taskset environments can type `harness` as `vf.HarnessConfig`; TOML can
then select a registered harness with names like `type = "terminus2"` or
`type = "pi"` inside the harness table.
then select a registered harness with names like `type = "codex"`,
`type = "claude-code"`, or `type = "pi"` inside the harness table.

The taskset-only shape is:

Expand Down
26 changes: 15 additions & 11 deletions tests/test_v1_config_extension.py
Original file line number Diff line number Diff line change
Expand Up @@ -1565,32 +1565,36 @@ class LocalEnvConfig(EnvConfig):


@pytest.mark.parametrize(
("alias", "config_cls", "harness_cls"),
("alias", "config_cls", "harness_cls", "config_fields"),
[
("opencode", vf.OpenCodeConfig, vf.OpenCode),
("open-code", vf.OpenCodeConfig, vf.OpenCode),
("mini-swe-agent", vf.MiniSWEAgentConfig, vf.MiniSWEAgent),
("pi", vf.PiConfig, vf.Pi),
("rlm", vf.RLMConfig, vf.RLM),
("terminus2", vf.Terminus2Config, vf.Terminus2),
("terminus-2", vf.Terminus2Config, vf.Terminus2),
("opencode", vf.OpenCodeConfig, vf.OpenCode, {"max_turns": 4}),
("open-code", vf.OpenCodeConfig, vf.OpenCode, {"max_turns": 4}),
("claude", vf.ClaudeCodeConfig, vf.ClaudeCode, {"max_turns": 4}),
("claude-code", vf.ClaudeCodeConfig, vf.ClaudeCode, {"max_turns": 4}),
("codex", vf.CodexConfig, vf.Codex, {}),
("mini-swe-agent", vf.MiniSWEAgentConfig, vf.MiniSWEAgent, {"max_turns": 4}),
("pi", vf.PiConfig, vf.Pi, {"max_turns": 4}),
("rlm", vf.RLMConfig, vf.RLM, {"max_turns": 4}),
("terminus2", vf.Terminus2Config, vf.Terminus2, {"max_turns": 4}),
("terminus-2", vf.Terminus2Config, vf.Terminus2, {"max_turns": 4}),
],
)
def test_env_config_harness_type_selects_packaged_harness_config(
alias, config_cls, harness_cls
alias, config_cls, harness_cls, config_fields
) -> None:
class GenericEnvConfig(EnvConfig):
taskset: TasksetConfig = TasksetConfig(source=[])
harness: HarnessConfig = vf.OpenCodeConfig()

config = coerce_config(
GenericEnvConfig,
{"harness": {"type": alias, "max_turns": 4}},
{"harness": {"type": alias, **config_fields}},
)
env = Env(config=config)

assert isinstance(config.harness, config_cls)
assert config.harness.max_turns == 4
if "max_turns" in config_fields:
assert config.harness.max_turns == 4
assert isinstance(env.harness, harness_cls)


Expand Down
83 changes: 82 additions & 1 deletion tests/test_v1_harbor_cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@

import verifiers as root_vf
import verifiers.v1 as vf
from verifiers.v1.packages.harnesses.claude_code import claude_code_mcp_json
from verifiers.v1.packages.harnesses.codex import codex_mcp_toml
from verifiers.v1.packages.harnesses.configs import (
TERMINUS_2_DEFAULT_API_BASE_URL,
TERMINUS_2_DEFAULT_HARBOR_PACKAGE,
Expand Down Expand Up @@ -212,9 +214,19 @@ async def test_harbor_reward_uses_background_job_for_tests(


def test_packaged_harbor_and_opencode_imports_are_reexported() -> None:
from verifiers.v1.packages.harnesses import OpenCode, OpenCodeConfig, Pi
from verifiers.v1.packages.harnesses import (
ClaudeCode,
Codex,
OpenCode,
OpenCodeConfig,
Pi,
)
from verifiers.v1.packages.tasksets import HarborTaskset

assert vf.ClaudeCode is ClaudeCode
assert root_vf.ClaudeCode is ClaudeCode
assert vf.Codex is Codex
assert root_vf.Codex is Codex
assert vf.OpenCode is OpenCode
assert vf.OpenCodeConfig is OpenCodeConfig
assert vf.Pi is Pi
Expand Down Expand Up @@ -255,6 +267,8 @@ def test_opencode_config_owns_opencode_harness_fields() -> None:
("harness_cls", "config_cls"),
[
(vf.OpenCode, vf.OpenCodeConfig),
(vf.ClaudeCode, vf.ClaudeCodeConfig),
(vf.Codex, vf.CodexConfig),
(vf.MiniSWEAgent, vf.MiniSWEAgentConfig),
(vf.Pi, vf.PiConfig),
(vf.RLM, vf.RLMConfig),
Expand Down Expand Up @@ -312,6 +326,73 @@ def test_pi_harness_writes_intercepted_model_and_mcp_config() -> None:
assert mcp["mcpServers"]["verifiers-tools"]["command"] == "python3"


def test_claude_code_harness_builds_sandbox_program() -> None:
harness = vf.ClaudeCode(
config=vf.ClaudeCodeConfig(
system_prompt="extra system prompt",
agent_workdir="/workspace",
max_turns=7,
)
)
program = cast(dict[str, object], harness.program)
command = cast(list[object], program["command"])
setup = cast(str, program["setup"])
files = cast(dict[str, object], program["files"])
env = cast(dict[str, object], program["env"])
mcp = json.loads(claude_code_mcp_json())

assert "npm install -g @anthropic-ai/claude-code" in setup
assert "/claude-code/instruction.txt" in files
assert "/claude-code/system.txt" in files
assert program["channels"] == "mcp"
assert env["ANTHROPIC_MODEL"] == "runtime.model"
assert env["CLAUDE_CODE_ENABLE_GATEWAY_MODEL_DISCOVERY"] == "1"
assert "cat /claude-code/instruction.txt | claude -p" in cast(str, command[2])
assert '"$(cat /claude-code/instruction.txt)"' not in cast(str, command[2])
assert "--max-turns 7" in cast(str, command[2])
assert "--permission-mode bypassPermissions" in cast(str, command[2])
assert "--mcp-config /tmp/claude-code-mcp.json" in cast(str, command[2])
assert mcp["mcpServers"]["verifiers-tools"]["command"] == "python3"


def test_codex_harness_builds_sandbox_program() -> None:
harness = vf.Codex(
config=vf.CodexConfig(
system_prompt="extra system prompt",
agent_workdir="/workspace",
codex_sandbox="workspace-write",
model_reasoning_effort="high",
)
)
program = cast(dict[str, object], harness.program)
command = cast(list[object], program["command"])
setup = cast(str, program["setup"])
files = cast(dict[str, object], program["files"])
env = cast(dict[str, object], program["env"])
mcp_toml = codex_mcp_toml()

assert "npm install -g @openai/codex" in setup
assert "/codex/instruction.txt" in files
assert "/codex/system.txt" in files
assert program["channels"] == "mcp"
assert env["OPENAI_MODEL"] == "runtime.model"
assert callable(env["CODEX_API_KEY"])
assert 'model_provider = "verifiers"' in cast(str, command[2])
assert 'approval_policy = "never"' in cast(str, command[2])
assert 'sandbox_mode = "workspace-write"' in cast(str, command[2])
assert 'model_reasoning_effort = "high"' in cast(str, command[2])
assert "--sandbox workspace-write" in cast(str, command[2])
assert "--output-last-message /logs/agent/codex.txt.final" in cast(str, command[2])
assert "- < /logs/agent/codex.txt.prompt" in cast(str, command[2])
assert '"$(cat /logs/agent/codex.txt.prompt)"' not in cast(str, command[2])
assert 'command = "python3"' in mcp_toml


def test_codex_config_rejects_max_turns() -> None:
with pytest.raises(ValueError, match="CodexConfig.max_turns is not supported"):
vf.CodexConfig(max_turns=7)


def test_terminus_2_harness_builds_sandbox_program() -> None:
harness = vf.Terminus2(
config=vf.Terminus2Config(
Expand Down
12 changes: 12 additions & 0 deletions verifiers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,10 @@
"UserConfig",
"HarborTaskset",
"HarborTasksetConfig",
"ClaudeCode",
"ClaudeCodeConfig",
"Codex",
"CodexConfig",
"MiniSWEAgent",
"MiniSWEAgentConfig",
"OpenCode",
Expand Down Expand Up @@ -219,6 +223,10 @@
"UserConfig": "verifiers.v1:UserConfig",
"HarborTaskset": "verifiers.v1:HarborTaskset",
"HarborTasksetConfig": "verifiers.v1:HarborTasksetConfig",
"ClaudeCode": "verifiers.v1:ClaudeCode",
"ClaudeCodeConfig": "verifiers.v1:ClaudeCodeConfig",
"Codex": "verifiers.v1:Codex",
"CodexConfig": "verifiers.v1:CodexConfig",
"MiniSWEAgent": "verifiers.v1:MiniSWEAgent",
"MiniSWEAgentConfig": "verifiers.v1:MiniSWEAgentConfig",
"OpenCode": "verifiers.v1:OpenCode",
Expand Down Expand Up @@ -315,6 +323,10 @@ def __getattr__(name: str):
HarborTasksetConfig,
MCPTool,
MCPToolConfig,
ClaudeCode,
ClaudeCodeConfig,
Codex,
CodexConfig,
MiniSWEAgent,
MiniSWEAgentConfig,
MutableConfigMap,
Expand Down
4 changes: 2 additions & 2 deletions verifiers/v1/ENVIRONMENT_BEST_PRACTICES.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,8 +42,8 @@ your loader runs. The type annotation is not cosmetic.
your loader can convert or forward them.
3. If your environment has custom harness fields, the same rule applies to the
`harness` annotation unless TOML selects a registered harness config with
`[env.harness] type = "terminus2"`, `[env.harness] type = "pi"`, or another
owner/config alias.
`[env.harness] type = "codex"`, `[env.harness] type = "claude-code"`, or
another owner/config alias.
4. The config object that reaches `load_environment` is already validated and
typed. Do not reconstruct child config objects just to recover their type.
5. `vf.Env(taskset=MyTaskset(config=config.taskset), harness=MyHarness(config=config.harness))`
Expand Down
12 changes: 7 additions & 5 deletions verifiers/v1/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -574,9 +574,9 @@ signature.

Reusable CLI programs should be packaged as `Harness` subclasses. Package
implementations live under `verifiers.v1.packages` while the v1 API stabilizes,
and are re-exported from `verifiers.v1` for normal use. `OpenCode`, `Pi`,
`MiniSWEAgent`, `Terminus2`, and `RLM` are bundled `Harness` leaf wrappers for
common coding-agent CLIs.
and are re-exported from `verifiers.v1` for normal use. `OpenCode`,
`ClaudeCode`, `Codex`, `Pi`, `MiniSWEAgent`, `Terminus2`, and `RLM` are
bundled `Harness` leaf wrappers for common coding-agent CLIs.

```python
import verifiers as vf
Expand All @@ -603,6 +603,8 @@ endpoint and, when tools are enabled, installs the Pi MCP adapter and writes a
project `.mcp.json`. Neither side needs to know the other's private fields.
`MiniSWEAgent` owns mini-swe-agent installation, config layering, endpoint env,
and log/trajectory artifacts.
`ClaudeCode` and `Codex` package the Claude Code and Codex CLI non-interactive
modes with endpoint, MCP proxy, and log artifact wiring.
`Terminus2` owns Harbor Terminus agent installation, endpoint env, and log
artifacts.
`RLM` follows the same boundary for recursive LLM runs: `HarborTaskset` owns
Expand Down Expand Up @@ -1243,8 +1245,8 @@ and harness config types for the loader.

Reusable taskset environments can keep `harness` typed as `vf.HarnessConfig`.
Then TOML may select a registered harness config with `type`, for example
`type = "terminus2"` or `type = "pi"`, and pass that config's ordinary fields
beside it. Use `harness = "pi"` when the selected harness needs no field
`type = "codex"` or `type = "claude-code"`, and pass that config's ordinary
fields beside it. Use `harness = "pi"` when the selected harness needs no field
overrides.

```python
Expand Down
8 changes: 8 additions & 0 deletions verifiers/v1/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,10 @@
from .env import Env
from .harness import Harness
from .packages.harnesses import (
ClaudeCode,
ClaudeCodeConfig,
Codex,
CodexConfig,
MiniSWEAgent,
MiniSWEAgentConfig,
OpenCode,
Expand Down Expand Up @@ -80,6 +84,10 @@
__all__ = [
"ConfigData",
"CallableConfig",
"ClaudeCode",
"ClaudeCodeConfig",
"Codex",
"CodexConfig",
"Config",
"ConfigMap",
"Env",
Expand Down
8 changes: 8 additions & 0 deletions verifiers/v1/packages/harnesses/__init__.py
Original file line number Diff line number Diff line change
@@ -1,17 +1,25 @@
from .configs import (
ClaudeCodeConfig,
CodexConfig,
MiniSWEAgentConfig,
OpenCodeConfig,
PiConfig,
RLMConfig,
Terminus2Config,
)
from .claude_code import ClaudeCode
from .codex import Codex
from .mini_swe_agent import MiniSWEAgent
from .opencode import OpenCode
from .pi import Pi
from .rlm import RLM
from .terminus_2 import Terminus2

__all__ = [
"ClaudeCode",
"ClaudeCodeConfig",
"Codex",
"CodexConfig",
"MiniSWEAgent",
"MiniSWEAgentConfig",
"OpenCode",
Expand Down
Loading
Loading