fix: surface test-stdout.txt in verifier exception for dep_install classification (#540)#572
fix: surface test-stdout.txt in verifier exception for dep_install classification (#540)#572ElegantLin wants to merge 2 commits into
Conversation
…l classifier fires (#540) The verifier_error_category=dep_install classification could never fire in production because RewardFileNotFoundError only contained file paths, not pip's actual error output. Now the exception tails test-stdout.txt (last 30 lines) into the message, letting classify_verifier_error see markers like "no solution found" from real pip failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| f"{self._rollout_paths.reward_json_path}" | ||
| ) | ||
| if stdout_tail: | ||
| msg += f"\n--- test-stdout.txt (last 30 lines) ---\n{stdout_tail}" |
There was a problem hiding this comment.
🟡 Hardcoded "30" in error message drifts from _TAIL_LINES constant
The error message at src/benchflow/task/verifier.py:317 uses a hardcoded string "last 30 lines" instead of referencing the _TAIL_LINES constant defined at src/benchflow/task/verifier.py:63. If _TAIL_LINES is ever changed (e.g. to 50), _tail_file will return a different number of lines but the error message will still claim "last 30 lines", misleading anyone triaging verifier failures.
| msg += f"\n--- test-stdout.txt (last 30 lines) ---\n{stdout_tail}" | |
| msg += f"\n--- test-stdout.txt (last {_TAIL_LINES} lines) ---\n{stdout_tail}" |
Was this helpful? React with 👍 or 👎 to provide feedback.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bingran-you
left a comment
There was a problem hiding this comment.
Thermo-nuclear review verdict: request changes.
The dependency-install classification improvement is useful, and the real Claude ACP smoke passed, but the current implementation surfaces raw verifier stdout through RewardFileNotFoundError / verifier_error. That widens the secret-leak blast radius because test-stdout.txt can contain env dumps, install URLs, tokens, or other untrusted subprocess output that later lands in summaries/dashboards.
Please redact before appending stdout tails, or keep raw stdout out of result metadata and classify from a private diagnostic path instead.
Additional maintainability issues:
_tail_file()reads/splits the whole file to return 30 lines. Please stream-tail it, e.g. withcollections.deque(f, maxlen=n).tests/test_verify.pygrows past 1k lines. Please move these regression tests to a focused verifier-output module.- Regression test docstrings should name the reviewed PR/fix explicitly, e.g. PR #572 / #540, per repo convention.
Verification notes:
- Branch-head checks passed:
uv sync, fullpytest,ty,ruff. - Current-merge
pytestandtypassed. - Current-merge
rufffailed only on existing main baselineUP041insrc/benchflow/sandbox/docker.py:408, not this PR. - Real
claude-agent-acpDocker smoke passed: reward1.0,trajectory_source=acp, 5 ACP events, 1 tool call, usageprovider_response, total tokens64349,llm_trajectory.jsonlpresent.
Summary
RewardFileNotFoundErrorinverifier.pyonly included file paths in its message, never pip's actual error output. Theclassify_verifier_errorfunction haddep_installmarkers ("no solution found", etc.) but could never match them because they lived intest-stdout.txt, not the exception.RewardFileNotFoundError, tail the last 30 lines oftest-stdout.txtinto the exception message. This letsclassify_verifier_errorsee pip's output and correctly returnverifier_dep_install.scoring.py— the classifier logic and markers were already correct (added in prior PRs); only the wiring was broken.Test plan
test_classify_verifier_errorcovering all dep_install markers in exception messagestest_dep_install_takes_precedence_over_generic_crash— precedence testTestRewardFileNotFoundSurfacesStdout— 4 end-to-end wiring tests: stdout tail present, classifier fires, missing file graceful, non-dep-install staysverifier_failureCloses #540
🤖 Generated with Claude Code