Verifier follow-ups batch (PR #320-#324)#330
Merged
Merged
Conversation
The passed/failed/errored/verifier_errored bucketing was duplicated with textually-divergent predicates in Evaluation.run() and TaskMetrics. Extract a single classify_result/classify_result_dict in _utils/scoring.py so both sites consume one definition and the passed+failed+errored+verifier_errored == total invariant holds structurally. Rewrite the tautological test_error_and_verifier_error_counted_once to call the real classifier instead of re-implementing the logic inline. From PR #320 verification.
VerifierConfig.service is always a field and is_mounted is a property on every concrete sandbox, so the defensive getattr() calls in verifier.py are dead. Declare is_mounted on the BaseSandbox ABC (default False) and use direct attribute access for both service and is_mounted. Add a verifier test covering is_mounted=True + service="target": the is_mounted fast path is gated on service == "main", and that guard was previously untested because the test stub always had is_mounted=False. The new test fails if the guard is removed. From PR #321 verification.
The acpx: key convention is load-bearing in two coupled sites: _acpx_wrap mints the prefixed name and resolve_agent round-trips it. Make resolve_agent_key the documented single owner of the namespace via an ACPX_KEY_PREFIX constant and acpx_runtime_key() helper, with a module-level contract comment so the coupling is explicit rather than implied by an easily-missed comment. Behavior unchanged. From PR #322 verification.
_pick_split_file matched basenames starting with "{split}-", which could
pick a sibling subset like test-small-00000-of-00001.parquet for
split="test". Anchor the sharded match on the HF {split}-NNNNN-of-NNNNN
convention so only the genuine shard resolves.
_wrap_command_with_env_file chained rm -f with &&, so a failed sourcing
step short-circuited the cleanup and leaked the temp env file. Use
trap 'rm -f ...' EXIT so the env file is removed unconditionally.
From PR #323 verification.
test_api_failure_surfaces_original_error referenced last_error, which was deleted in PR #324 — fix the prose. Convert the ~18 sync tests still using asyncio.run() to async def + await, removing the latent order-dependence class that PR #325 fixed elsewhere (the repo runs asyncio_mode = "auto"). Mechanical conversion only; assertions unchanged. Full suite verified in default, reverse, and shuffled file orderings. From PR #324 / #325 verification.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Batch of minor, non-blocking follow-ups surfaced by verification teams across PRs #320–#324 (and #325). Each item is verified-real; changes are kept minimal and behavior-preserving.
Items
classify_result/classify_result_dictinsrc/benchflow/_utils/scoring.py).Evaluation.run()andTaskMetricsnow both consume it, sopassed+failed+errored+verifier_errored == totalholds structurally. The classifier closes a latent gap (no-reward/no-error →errored) so the buckets are exhaustive, not just disjoint.test_error_and_verifier_error_counted_onceto call the realclassify_result_dictinstead of re-implementing the bucketing inline.getattr(feat(verifier): target-side test.sh verification for multi-container tasks (#248) #321).verifier.pynow uses directself._task.config.verifier.serviceaccess (serviceis always aVerifierConfigfield).is_mountedon the ABC (feat(verifier): target-side test.sh verification for multi-container tasks (#248) #321). Declaredis_mountedonBaseSandbox(defaultFalse); dropped thegetattr(self._sandbox, "is_mounted", False)for direct access.is_mounted=True+service="target". It fails if theservice == "main" andguard on theis_mountedfast path is removed (verified by manual mutation).acpx:round-trip contract (fix(agents): repair acpx agent resolution; complete register_agent fields #322).resolve_agent_keyis now the documented single owner of theacpx:namespace, via anACPX_KEY_PREFIXconstant, anacpx_runtime_key()helper, and a module-level contract comment. Behavior unchanged._pick_split_fileregex (fix: traces split handling, COPY staging, exec env secrecy (audit findings) #323). Tightened the HF split match to the sharded^{split}-\d+-of-\d+convention so a sibling subset liketest-small-00000-of-00001.parquetis not mistaken forsplit="test". Exact{split}{suffix}match kept._wrap_command_with_env_filenow usestrap 'rm -f ...' EXITso the temp env file is deleted unconditionally even when the sourcing step fails the&&chain.test_api_failure_surfaces_original_errordocstring prose that referenced the deletedlast_error.asyncio.run()tests intest_llm_judge.pytoasync def+await(repo runsasyncio_mode = "auto"). Mechanical conversion; assertions unchanged.Verification
ruff format --check,ruff check .,ty check src/— all clean.pytest tests/— 1310 passed, 27 skipped, run in 3 file orderings (default, reverse, shuffled) to confirm item 10 introduced no order-dependence.Note
Medium Risk
Medium risk because it changes how
Evaluation.run()and metrics count pass/fail/error buckets and tweaks verifier download behavior; regressions would skew reported scores or miss verifier artifacts.Overview
Introduces a shared terminal result classifier (
classify_result/classify_result_dict) and switches bothEvaluation.run()andTaskMetricsto use it, making the passed/failed/errored/verifier-errored buckets disjoint and exhaustive (including treating rewardless, errorless results aserrored).Tightens a few runtime contracts and edge cases: documents and centralizes the
acpx:runtime-key namespace (ACPX_KEY_PREFIX/acpx_runtime_key), makesBaseSandbox.is_mountedan explicit API and uses it directly inVerifier, ensures Docker exec env temp files are always cleaned up viatrap ... EXIT, and fixes HF trace split selection to avoid picking subset shards (e.g.test-small-*).Tests are updated/added to assert the shared classifier behavior, cover mounted-vs-target verifier downloads, and convert
test_llm_judge.pytoasync/await(no moreasyncio.run()).Reviewed by Cursor Bugbot for commit 1393c63. Bugbot is set up for automated code reviews on this repo. Configure here.