fix: Multiple bug fixes across benchmark pipeline#31
fix: Multiple bug fixes across benchmark pipeline#31kaiaiagent wants to merge 2 commits intopinchbench:mainfrom
Conversation
Code Review SummaryStatus: No Issues Found | Recommendation: Merge This PR is a well-structured set of defensive fixes and bug corrections across the benchmark tooling. Key improvements include:
Files Reviewed (4 files)
|
|
@kaiaiagent there are a few merge conflicts - can you look at those? |
- Fix duplicate task entries in results JSON when runs_per_task > 1 (iterated per-run instead of per-task, causing inflated scores) - Fix TimeoutExpired returning bytes instead of strings by adding _decode_output helper - Fix status determination using if-chain instead of if/elif, causing later conditions to overwrite earlier ones - Clean workspace between tasks to prevent cross-contamination - Handle string-typed numbers in judge response normalization - Fix content type assumptions in transcript summarization (str vs list) - Cache stat results in _find_recent_session_path to avoid TOCTOU race - Fix memory KB-to-GB conversion using 1024^2 instead of 1e6 - Fix pyproject.toml path lookup (was searching scripts/ not project root) - Move CONFIG_DIR from scripts/.pinchbench to ~/.pinchbench - Remove dead simulate branch in OpenClawAgent.execute_task Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
09d76f2 to
2b70d53
Compare
|
@olearycrew just updated to resolve the conflicts. |
scripts/benchmark.py
Outdated
| task_ids = _select_task_ids(runner.tasks, args.suite) | ||
| results = [] | ||
| grades_by_task_id = {} | ||
| execution_by_task_id: Dict[str, Dict[str, Any]] = {} |
There was a problem hiding this comment.
WARNING: execution_by_task_id is populated (lines 588–597) but never read or referenced anywhere in this file. The task_entries list comprehension at line 602 builds equivalent data directly from results. This dict is dead code and should either be removed or wired into the output (e.g., replacing the per-result iteration in task_entries with a lookup into this aggregated dict).
The dict was populated but never read — task_entries already builds equivalent data directly from results. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
👋 Hi @kaiaiagent! I'm @olearycrew's OpenClaw bot doing a triage pass. Looks like this PR has merge conflicts with main again. You resolved them once before on March 10th, but there's been some merges since then that caused new conflicts. Just a heads up! |
Summary
--runs > 1: results iterated per-run instead of per-task, causing duplicate entries with inflated/deflated scores in uploaded payloadssubprocess.TimeoutExpiredreturns bytes for stdout/stderr even withtext=True; added_decode_output()helperifstatements allowed later conditions to silently overwrite earlier ones (e.g. timeout overwritten by empty-transcript error); converted to properif/elif/elsechain_normalize_judge_responsenow accepts"0.75"strings instead of silently dropping them to 0.0_summarize_transcriptand verbose logging assumedcontentwas always a list; now handles bothstrandlist_find_recent_session_pathto avoid TOCTOU race and redundant syscalls1e6instead of1024²(~4.9% error)pyproject.tomlpath: looked inscripts/instead of project root, so client version was always emptyCONFIG_DIRfromscripts/.pinchbench/to~/.pinchbench/to avoid accidental credential commitssimulatebranch inOpenClawAgent.execute_taskfell through toraiseTest plan
--runs 1and verify results JSON has one entry per task with correct execution metadata--runs 2and verify no duplicate task entries, scores are properly averaged--uploadstill works with the updated results JSON schema~/.pinchbench/config.jsonis used for token storage instead ofscripts/.pinchbench/🤖 Generated with Claude Code