feat(eval): let env_args.model override --model in vf-eval#1417
Open
mvanhorn wants to merge 1 commit into
Open
feat(eval): let env_args.model override --model in vf-eval#1417mvanhorn wants to merge 1 commit into
mvanhorn wants to merge 1 commit into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
verifiers/scripts/eval.pyIn
build_eval_config()immediately before theextra_env_kwargs["timeout_seconds"] = raw["timeout"]block (line ~796), insert:setdefaultkeeps any caller-injected value (from[[eval]]TOML configs that already setextra_env_kwargs.model) authoritative.verifiers/utils/eval_utils.py::run_evaluationsIf
env_argscontains"model", it should remain the source of truth at env-construction time. The current code already mergesenv_args | extra_env_kwargswhen instantiating the env class. Verify the merge order:If the existing merge order is
extra_env_kwargslast, Python will raiseTypeError: multiple values for keyword argument 'model'when both are set. Fix: dropmodelfromextra_env_kwargswhen present inenv_args:Before editing, grep for the actual instantiation site:
and adjust the snippet to match the local idiom.
Docs
Append a short subsection to
README.md(or the existingdocs/evaluation.md) explaining precedence:Why this matters
Issue #297 (filed by @damoonsh, NONE) reports that
vf-evalrequires the model name to be passed twice when a customMultiTurn-extending env uses the model insideenv_response: once with-mfor the inference client, and again via-a '{"model": "..."}'so the env can read it. The user asks: "Could we give precedence to model arg instead of -m inside eval.py file?"The clean resolution is the inverse — keep
-mas the canonical CLI input for the inference client (its current role) but pass the resolved model into the env instantiation as a well-known kwarg so user-defined envs do not need to pullmodelout ofenv_argsat all. As a fallback for envs that want a different model than the inference client (e.g. a separate user-sim model), keepenv_args.modelworking and have it override only inside the env, leaving the inference client unaffected.Acceptance:
vf-eval -m foo/barmakes the resolved model id available to custom envs viaextra_env_kwargs["model"]soMyEnv(__init__).__init__(self, *, model: str, ...)"just works" without duplicating in-a.env_argsalready contains amodelkey, that value wins in the env (the user explicitly overrode it), but the inference client still uses-m(no behavior change there).-mis plumbed intoextra_env_kwargs.Testing
tests/scripts/test_eval_model_kwarg.py(new)The
build_eval_configfunction is currently a closure insidemain()(per the surrounding code shape). If it remains a closure, refactor it to a module-level helper (or expose a thin wrapper) so it is unit-testable. This refactor is itself worth doing — it is a small, well-bounded improvement that also makes the rest ofbuild_eval_configtestable.If extracting the closure is rejected by the maintainer, fall back to a CLI integration test that invokes
python -m verifiers.scripts.eval --dry-run ...(if such a mode exists) or asserts via a smoke-test fixture env that records the kwargs it was constructed with.Run with
uv run pytest tests/scripts/test_eval_model_kwarg.py -v.Fixes #297
AI was used for assistance.
Note
Medium Risk
Touches evaluation config construction and environment instantiation plumbing, which can subtly change how envs receive kwargs (especially around
model) and could break custom envload_environment()signatures if assumptions differ.Overview
Model/kwarg precedence for eval environments is formalized.
build_eval_config()is extracted to module scope and now injects the resolved inferencemodelintoextra_env_kwargsby default, but drops it when the user explicitly setsenv_args.model.Environment construction avoids
modelkwarg collisions.run_evaluation()detects whetherload_environment()can acceptmodeland, when appropriate, passes the injectedmodelviaenv_argswhile keeping it out ofset_kwargs.Docs add a short Model precedence section, and new unit tests assert that (1) resolved
-mlands inextra_env_kwargs, and (2)env_args.modeloverrides only the environment view, not the inference client model.Reviewed by Cursor Bugbot for commit cf5f00a. Bugbot is set up for automated code reviews on this repo. Configure here.
Note
Let
env_args.modeloverride--modelfor the environment invf-evalbuild_eval_confignow setsextra_env_kwargs['model']to the resolved client model so environments receive the inference model without extra flags.--env-args model=<value>is provided, that value is used as the environment's model instead, while the inference client continues using--model.run_evaluationuses a new_load_environment_accepts_arghelper to introspect whetherload_environmentaccepts amodelkwarg before passing it, avoiding errors on environments that don't support it.modelkwarg will now receive it automatically from the resolved--modelvalue unless overridden via--env-args.Macroscope summarized cf5f00a.