Verifier follow-ups batch (PR #320-#324) by xdotli · Pull Request #330 · benchflow-ai/benchflow

xdotli · 2026-05-21T21:51:34Z

Batch of minor, non-blocking follow-ups surfaced by verification teams across PRs #320–#324 (and #325). Each item is verified-real; changes are kept minimal and behavior-preserving.

Items

Disjointness helper (fix: disjoint error/verifier_error result buckets + skill_eval glob collision #320). Extracted one shared result-bucket classifier (classify_result / classify_result_dict in src/benchflow/_utils/scoring.py). Evaluation.run() and TaskMetrics now both consume it, so passed+failed+errored+verifier_errored == total holds structurally. The classifier closes a latent gap (no-reward/no-error → errored) so the buckets are exhaustive, not just disjoint.
Tautological test (fix: disjoint error/verifier_error result buckets + skill_eval glob collision #320). Rewrote test_error_and_verifier_error_counted_once to call the real classify_result_dict instead of re-implementing the bucketing inline.
Dead getattr (feat(verifier): target-side test.sh verification for multi-container tasks (#248) #321). verifier.py now uses direct self._task.config.verifier.service access (service is always a VerifierConfig field).
is_mounted on the ABC (feat(verifier): target-side test.sh verification for multi-container tasks (#248) #321). Declared is_mounted on BaseSandbox (default False); dropped the getattr(self._sandbox, "is_mounted", False) for direct access.
Missing test (feat(verifier): target-side test.sh verification for multi-container tasks (#248) #321). Added a verifier test covering is_mounted=True + service="target". It fails if the service == "main" and guard on the is_mounted fast path is removed (verified by manual mutation).
acpx: round-trip contract (fix(agents): repair acpx agent resolution; complete register_agent fields #322). resolve_agent_key is now the documented single owner of the acpx: namespace, via an ACPX_KEY_PREFIX constant, an acpx_runtime_key() helper, and a module-level contract comment. Behavior unchanged.
_pick_split_file regex (fix: traces split handling, COPY staging, exec env secrecy (audit findings) #323). Tightened the HF split match to the sharded ^{split}-\d+-of-\d+ convention so a sibling subset like test-small-00000-of-00001.parquet is not mistaken for split="test". Exact {split}{suffix} match kept.
env-file cleanup (fix: traces split handling, COPY staging, exec env secrecy (audit findings) #323). _wrap_command_with_env_file now uses trap 'rm -f ...' EXIT so the temp env file is deleted unconditionally even when the sourcing step fails the && chain.
Stale docstring (fix(rewards): _call_google returns '' instead of None on empty text #324). Fixed test_api_failure_surfaces_original_error docstring prose that referenced the deleted last_error.
asyncio-hardening continuation (test: harden async tests against event-loop order-dependence #325). Converted the ~18 sync asyncio.run() tests in test_llm_judge.py to async def + await (repo runs asyncio_mode = "auto"). Mechanical conversion; assertions unchanged.

Verification

ruff format --check, ruff check ., ty check src/ — all clean.
pytest tests/ — 1310 passed, 27 skipped, run in 3 file orderings (default, reverse, shuffled) to confirm item 10 introduced no order-dependence.

Note

Medium Risk
Medium risk because it changes how Evaluation.run() and metrics count pass/fail/error buckets and tweaks verifier download behavior; regressions would skew reported scores or miss verifier artifacts.

Overview
Introduces a shared terminal result classifier (classify_result/classify_result_dict) and switches both Evaluation.run() and TaskMetrics to use it, making the passed/failed/errored/verifier-errored buckets disjoint and exhaustive (including treating rewardless, errorless results as errored).

Tightens a few runtime contracts and edge cases: documents and centralizes the acpx: runtime-key namespace (ACPX_KEY_PREFIX/acpx_runtime_key), makes BaseSandbox.is_mounted an explicit API and uses it directly in Verifier, ensures Docker exec env temp files are always cleaned up via trap ... EXIT, and fixes HF trace split selection to avoid picking subset shards (e.g. test-small-*).

Tests are updated/added to assert the shared classifier behavior, cover mounted-vs-target verifier downloads, and convert test_llm_judge.py to async/await (no more asyncio.run()).

^{Reviewed by Cursor Bugbot for commit 1393c63. Bugbot is set up for automated code reviews on this repo. Configure here.}

The passed/failed/errored/verifier_errored bucketing was duplicated with textually-divergent predicates in Evaluation.run() and TaskMetrics. Extract a single classify_result/classify_result_dict in _utils/scoring.py so both sites consume one definition and the passed+failed+errored+verifier_errored == total invariant holds structurally. Rewrite the tautological test_error_and_verifier_error_counted_once to call the real classifier instead of re-implementing the logic inline. From PR #320 verification.

VerifierConfig.service is always a field and is_mounted is a property on every concrete sandbox, so the defensive getattr() calls in verifier.py are dead. Declare is_mounted on the BaseSandbox ABC (default False) and use direct attribute access for both service and is_mounted. Add a verifier test covering is_mounted=True + service="target": the is_mounted fast path is gated on service == "main", and that guard was previously untested because the test stub always had is_mounted=False. The new test fails if the guard is removed. From PR #321 verification.

The acpx: key convention is load-bearing in two coupled sites: _acpx_wrap mints the prefixed name and resolve_agent round-trips it. Make resolve_agent_key the documented single owner of the namespace via an ACPX_KEY_PREFIX constant and acpx_runtime_key() helper, with a module-level contract comment so the coupling is explicit rather than implied by an easily-missed comment. Behavior unchanged. From PR #322 verification.

_pick_split_file matched basenames starting with "{split}-", which could pick a sibling subset like test-small-00000-of-00001.parquet for split="test". Anchor the sharded match on the HF {split}-NNNNN-of-NNNNN convention so only the genuine shard resolves. _wrap_command_with_env_file chained rm -f with &&, so a failed sourcing step short-circuited the cleanup and leaked the temp env file. Use trap 'rm -f ...' EXIT so the env file is removed unconditionally. From PR #323 verification.

test_api_failure_surfaces_original_error referenced last_error, which was deleted in PR #324 — fix the prose. Convert the ~18 sync tests still using asyncio.run() to async def + await, removing the latent order-dependence class that PR #325 fixed elsewhere (the repo runs asyncio_mode = "auto"). Mechanical conversion only; assertions unchanged. Full suite verified in default, reverse, and shuffled file orderings. From PR #324 / #325 verification.

…s-batch

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 5 additional findings.

xdotli added 6 commits May 21, 2026 17:50

Merge remote-tracking branch 'origin/main' into fix/verifier-followup…

1393c63

…s-batch

xdotli merged commit b8d8973 into main May 21, 2026

devin-ai-integration Bot reviewed May 21, 2026

View reviewed changes

xdotli deleted the fix/verifier-followups-batch branch May 22, 2026 14:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Verifier follow-ups batch (PR #320-#324)#330

Verifier follow-ups batch (PR #320-#324)#330
xdotli merged 6 commits into
mainfrom
fix/verifier-followups-batch

xdotli commented May 21, 2026 •

edited by cursor Bot

Loading

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xdotli commented May 21, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Items

Verification

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

xdotli commented May 21, 2026 •

edited by cursor Bot

Loading