Skip to content

Verifier follow-ups batch (PR #320-#324)#330

Merged
xdotli merged 6 commits into
mainfrom
fix/verifier-followups-batch
May 21, 2026
Merged

Verifier follow-ups batch (PR #320-#324)#330
xdotli merged 6 commits into
mainfrom
fix/verifier-followups-batch

Conversation

@xdotli
Copy link
Copy Markdown
Member

@xdotli xdotli commented May 21, 2026

Batch of minor, non-blocking follow-ups surfaced by verification teams across PRs #320#324 (and #325). Each item is verified-real; changes are kept minimal and behavior-preserving.

Items

  1. Disjointness helper (fix: disjoint error/verifier_error result buckets + skill_eval glob collision #320). Extracted one shared result-bucket classifier (classify_result / classify_result_dict in src/benchflow/_utils/scoring.py). Evaluation.run() and TaskMetrics now both consume it, so passed+failed+errored+verifier_errored == total holds structurally. The classifier closes a latent gap (no-reward/no-error → errored) so the buckets are exhaustive, not just disjoint.
  2. Tautological test (fix: disjoint error/verifier_error result buckets + skill_eval glob collision #320). Rewrote test_error_and_verifier_error_counted_once to call the real classify_result_dict instead of re-implementing the bucketing inline.
  3. Dead getattr (feat(verifier): target-side test.sh verification for multi-container tasks (#248) #321). verifier.py now uses direct self._task.config.verifier.service access (service is always a VerifierConfig field).
  4. is_mounted on the ABC (feat(verifier): target-side test.sh verification for multi-container tasks (#248) #321). Declared is_mounted on BaseSandbox (default False); dropped the getattr(self._sandbox, "is_mounted", False) for direct access.
  5. Missing test (feat(verifier): target-side test.sh verification for multi-container tasks (#248) #321). Added a verifier test covering is_mounted=True + service="target". It fails if the service == "main" and guard on the is_mounted fast path is removed (verified by manual mutation).
  6. acpx: round-trip contract (fix(agents): repair acpx agent resolution; complete register_agent fields #322). resolve_agent_key is now the documented single owner of the acpx: namespace, via an ACPX_KEY_PREFIX constant, an acpx_runtime_key() helper, and a module-level contract comment. Behavior unchanged.
  7. _pick_split_file regex (fix: traces split handling, COPY staging, exec env secrecy (audit findings) #323). Tightened the HF split match to the sharded ^{split}-\d+-of-\d+ convention so a sibling subset like test-small-00000-of-00001.parquet is not mistaken for split="test". Exact {split}{suffix} match kept.
  8. env-file cleanup (fix: traces split handling, COPY staging, exec env secrecy (audit findings) #323). _wrap_command_with_env_file now uses trap 'rm -f ...' EXIT so the temp env file is deleted unconditionally even when the sourcing step fails the && chain.
  9. Stale docstring (fix(rewards): _call_google returns '' instead of None on empty text #324). Fixed test_api_failure_surfaces_original_error docstring prose that referenced the deleted last_error.
  10. asyncio-hardening continuation (test: harden async tests against event-loop order-dependence #325). Converted the ~18 sync asyncio.run() tests in test_llm_judge.py to async def + await (repo runs asyncio_mode = "auto"). Mechanical conversion; assertions unchanged.

Verification

  • ruff format --check, ruff check ., ty check src/ — all clean.
  • pytest tests/ — 1310 passed, 27 skipped, run in 3 file orderings (default, reverse, shuffled) to confirm item 10 introduced no order-dependence.

Note

Medium Risk
Medium risk because it changes how Evaluation.run() and metrics count pass/fail/error buckets and tweaks verifier download behavior; regressions would skew reported scores or miss verifier artifacts.

Overview
Introduces a shared terminal result classifier (classify_result/classify_result_dict) and switches both Evaluation.run() and TaskMetrics to use it, making the passed/failed/errored/verifier-errored buckets disjoint and exhaustive (including treating rewardless, errorless results as errored).

Tightens a few runtime contracts and edge cases: documents and centralizes the acpx: runtime-key namespace (ACPX_KEY_PREFIX/acpx_runtime_key), makes BaseSandbox.is_mounted an explicit API and uses it directly in Verifier, ensures Docker exec env temp files are always cleaned up via trap ... EXIT, and fixes HF trace split selection to avoid picking subset shards (e.g. test-small-*).

Tests are updated/added to assert the shared classifier behavior, cover mounted-vs-target verifier downloads, and convert test_llm_judge.py to async/await (no more asyncio.run()).

Reviewed by Cursor Bugbot for commit 1393c63. Bugbot is set up for automated code reviews on this repo. Configure here.

xdotli added 6 commits May 21, 2026 17:50
The passed/failed/errored/verifier_errored bucketing was duplicated with
textually-divergent predicates in Evaluation.run() and TaskMetrics. Extract
a single classify_result/classify_result_dict in _utils/scoring.py so both
sites consume one definition and the
passed+failed+errored+verifier_errored == total invariant holds
structurally. Rewrite the tautological test_error_and_verifier_error_counted_once
to call the real classifier instead of re-implementing the logic inline.

From PR #320 verification.
VerifierConfig.service is always a field and is_mounted is a property on
every concrete sandbox, so the defensive getattr() calls in verifier.py
are dead. Declare is_mounted on the BaseSandbox ABC (default False) and
use direct attribute access for both service and is_mounted.

Add a verifier test covering is_mounted=True + service="target": the
is_mounted fast path is gated on service == "main", and that guard was
previously untested because the test stub always had is_mounted=False.
The new test fails if the guard is removed.

From PR #321 verification.
The acpx: key convention is load-bearing in two coupled sites: _acpx_wrap
mints the prefixed name and resolve_agent round-trips it. Make
resolve_agent_key the documented single owner of the namespace via an
ACPX_KEY_PREFIX constant and acpx_runtime_key() helper, with a
module-level contract comment so the coupling is explicit rather than
implied by an easily-missed comment. Behavior unchanged.

From PR #322 verification.
_pick_split_file matched basenames starting with "{split}-", which could
pick a sibling subset like test-small-00000-of-00001.parquet for
split="test". Anchor the sharded match on the HF {split}-NNNNN-of-NNNNN
convention so only the genuine shard resolves.

_wrap_command_with_env_file chained rm -f with &&, so a failed sourcing
step short-circuited the cleanup and leaked the temp env file. Use
trap 'rm -f ...' EXIT so the env file is removed unconditionally.

From PR #323 verification.
test_api_failure_surfaces_original_error referenced last_error, which was
deleted in PR #324 — fix the prose.

Convert the ~18 sync tests still using asyncio.run() to async def +
await, removing the latent order-dependence class that PR #325 fixed
elsewhere (the repo runs asyncio_mode = "auto"). Mechanical conversion
only; assertions unchanged. Full suite verified in default, reverse, and
shuffled file orderings.

From PR #324 / #325 verification.
@xdotli xdotli merged commit b8d8973 into main May 21, 2026
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 5 additional findings.

Open in Devin Review

@xdotli xdotli deleted the fix/verifier-followups-batch branch May 22, 2026 14:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant