Skip to content

feat(eval): let env_args.model override --model in vf-eval#1417

Open
mvanhorn wants to merge 1 commit into
PrimeIntellect-ai:mainfrom
mvanhorn:fix/297-env-args-model-precedence
Open

feat(eval): let env_args.model override --model in vf-eval#1417
mvanhorn wants to merge 1 commit into
PrimeIntellect-ai:mainfrom
mvanhorn:fix/297-env-args-model-precedence

Conversation

@mvanhorn
Copy link
Copy Markdown

@mvanhorn mvanhorn commented May 19, 2026

Summary

verifiers/scripts/eval.py

In build_eval_config() immediately before the extra_env_kwargs["timeout_seconds"] = raw["timeout"] block (line ~796), insert:

# Plumb the resolved -m / registry / provider model id into extra_env_kwargs so
# custom envs can opt-in (e.g. MyEnv(__init__).__init__(self, *, model: str, ...))
# without the user having to repeat it in -a / env_args. Users who want the env to
# use a different model than the inference client still set it via -a:
#     vf-eval -m google/gemma-3-27b-it -a '{"model": "qwen/qwen3-14b"}'
extra_env_kwargs.setdefault("model", model)

setdefault keeps any caller-injected value (from [[eval]] TOML configs that already set extra_env_kwargs.model) authoritative.

verifiers/utils/eval_utils.py::run_evaluations

If env_args contains "model", it should remain the source of truth at env-construction time. The current code already merges env_args | extra_env_kwargs when instantiating the env class. Verify the merge order:

# Pseudocode at instantiation site
env = env_class(**env_args, **extra_env_kwargs)

If the existing merge order is extra_env_kwargs last, Python will raise TypeError: multiple values for keyword argument 'model' when both are set. Fix: drop model from extra_env_kwargs when present in env_args:

if "model" in env_args and "model" in extra_env_kwargs:
    extra_env_kwargs = {k: v for k, v in extra_env_kwargs.items() if k != "model"}

Before editing, grep for the actual instantiation site:

grep -nR "env_args" verifiers/utils/eval_utils.py | grep -i "extra_env_kwargs\|env_class\|load_env"

and adjust the snippet to match the local idiom.

Docs

Append a short subsection to README.md (or the existing docs/evaluation.md) explaining precedence:

Model precedence in vf-eval

-m / --model sets the inference client's model. Custom envs that need to know
the model (e.g. for prefix-cache routing, judge models, user-sim models) can
read it from the model kwarg in their __init__. To use a different model
inside the env than the one driving inference, pass it explicitly via
-a '{"model": "..."}' — this overrides only the env's view; the inference
client still uses -m.

Why this matters

Issue #297 (filed by @damoonsh, NONE) reports that vf-eval requires the model name to be passed twice when a custom MultiTurn-extending env uses the model inside env_response: once with -m for the inference client, and again via -a '{"model": "..."}' so the env can read it. The user asks: "Could we give precedence to model arg instead of -m inside eval.py file?"

The clean resolution is the inverse — keep -m as the canonical CLI input for the inference client (its current role) but pass the resolved model into the env instantiation as a well-known kwarg so user-defined envs do not need to pull model out of env_args at all. As a fallback for envs that want a different model than the inference client (e.g. a separate user-sim model), keep env_args.model working and have it override only inside the env, leaving the inference client unaffected.

Acceptance:

  • vf-eval -m foo/bar makes the resolved model id available to custom envs via extra_env_kwargs["model"] so MyEnv(__init__).__init__(self, *, model: str, ...) "just works" without duplicating in -a.
  • If env_args already contains a model key, that value wins in the env (the user explicitly overrode it), but the inference client still uses -m (no behavior change there).
  • A new short README section under "vf-eval" documents the precedence rules.
  • Existing tests pass; new test confirms the resolved -m is plumbed into extra_env_kwargs.

Testing

tests/scripts/test_eval_model_kwarg.py (new)

import pytest

from verifiers.scripts.eval import build_eval_config


def test_resolved_model_lands_in_extra_env_kwargs(tmp_path, monkeypatch):
    # Build a minimal raw config dict that exercises the resolver
    raw = {
        "env_id": "math-python",
        "model": "openai/gpt-4.1-mini",
        "api_base_url": "https://example.test/v1",
        "api_key_var": "OPENAI_API_KEY",
    }
    monkeypatch.setenv("OPENAI_API_KEY", "sk-test")
    cfg = build_eval_config(raw)  # signature: dict -> EvalConfig
    assert cfg.model == "openai/gpt-4.1-mini"
    assert cfg.extra_env_kwargs.get("model") == "openai/gpt-4.1-mini"


def test_env_args_model_overrides_for_env_but_not_client(tmp_path, monkeypatch):
    raw = {
        "env_id": "math-python",
        "model": "openai/gpt-4.1-mini",
        "env_args": {"model": "qwen/qwen3-14b"},
        "api_base_url": "https://example.test/v1",
        "api_key_var": "OPENAI_API_KEY",
    }
    monkeypatch.setenv("OPENAI_API_KEY", "sk-test")
    cfg = build_eval_config(raw)
    # inference client model = -m
    assert cfg.model == "openai/gpt-4.1-mini"
    # env-side override survives in env_args
    assert cfg.env_args.get("model") == "qwen/qwen3-14b"
    # extra_env_kwargs.model is dropped (or absent) so kwargs do not collide
    assert cfg.extra_env_kwargs.get("model") is None

The build_eval_config function is currently a closure inside main() (per the surrounding code shape). If it remains a closure, refactor it to a module-level helper (or expose a thin wrapper) so it is unit-testable. This refactor is itself worth doing — it is a small, well-bounded improvement that also makes the rest of build_eval_config testable.

If extracting the closure is rejected by the maintainer, fall back to a CLI integration test that invokes python -m verifiers.scripts.eval --dry-run ... (if such a mode exists) or asserts via a smoke-test fixture env that records the kwargs it was constructed with.

Run with uv run pytest tests/scripts/test_eval_model_kwarg.py -v.

Fixes #297

AI was used for assistance.


Note

Medium Risk
Touches evaluation config construction and environment instantiation plumbing, which can subtly change how envs receive kwargs (especially around model) and could break custom env load_environment() signatures if assumptions differ.

Overview
Model/kwarg precedence for eval environments is formalized. build_eval_config() is extracted to module scope and now injects the resolved inference model into extra_env_kwargs by default, but drops it when the user explicitly sets env_args.model.

Environment construction avoids model kwarg collisions. run_evaluation() detects whether load_environment() can accept model and, when appropriate, passes the injected model via env_args while keeping it out of set_kwargs.

Docs add a short Model precedence section, and new unit tests assert that (1) resolved -m lands in extra_env_kwargs, and (2) env_args.model overrides only the environment view, not the inference client model.

Reviewed by Cursor Bugbot for commit cf5f00a. Bugbot is set up for automated code reviews on this repo. Configure here.

Note

Let env_args.model override --model for the environment in vf-eval

  • By default, build_eval_config now sets extra_env_kwargs['model'] to the resolved client model so environments receive the inference model without extra flags.
  • If --env-args model=<value> is provided, that value is used as the environment's model instead, while the inference client continues using --model.
  • run_evaluation uses a new _load_environment_accepts_arg helper to introspect whether load_environment accepts a model kwarg before passing it, avoiding errors on environments that don't support it.
  • Behavioral Change: environments that accept a model kwarg will now receive it automatically from the resolved --model value unless overridden via --env-args.

Macroscope summarized cf5f00a.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

-m take precedence over model arg being passed

1 participant