fix: Multiple bug fixes across benchmark pipeline by kaiaiagent · Pull Request #31 · pinchbench/skill

kaiaiagent · 2026-03-08T18:11:47Z

Summary

Fix duplicate task entries when --runs > 1: results iterated per-run instead of per-task, causing duplicate entries with inflated/deflated scores in uploaded payloads
Fix TimeoutExpired bytes/string confusion: subprocess.TimeoutExpired returns bytes for stdout/stderr even with text=True; added _decode_output() helper
Fix status determination logic: independent if statements allowed later conditions to silently overwrite earlier ones (e.g. timeout overwritten by empty-transcript error); converted to proper if/elif/else chain
Clean workspace between tasks: shared workspace caused cross-contamination where leftover files from task A could produce false positives in task B's grading
Handle string-typed judge scores: _normalize_judge_response now accepts "0.75" strings instead of silently dropping them to 0.0
Fix transcript content type handling: _summarize_transcript and verbose logging assumed content was always a list; now handles both str and list
Cache stat results in _find_recent_session_path to avoid TOCTOU race and redundant syscalls
Fix memory KB→GB conversion: used 1e6 instead of 1024² (~4.9% error)
Fix pyproject.toml path: looked in scripts/ instead of project root, so client version was always empty
Move CONFIG_DIR from scripts/.pinchbench/ to ~/.pinchbench/ to avoid accidental credential commits
Remove dead code: simulate branch in OpenClawAgent.execute_task fell through to raise

Test plan

Run benchmark with --runs 1 and verify results JSON has one entry per task with correct execution metadata
Run benchmark with --runs 2 and verify no duplicate task entries, scores are properly averaged
Trigger a task timeout and verify stdout/stderr are correctly decoded strings
Verify --upload still works with the updated results JSON schema
Confirm ~/.pinchbench/config.json is used for token storage instead of scripts/.pinchbench/

🤖 Generated with Claude Code

kilo-code-bot · 2026-03-08T18:13:55Z

Code Review Summary

Status: No Issues Found | Recommendation: Merge

This PR is a well-structured set of defensive fixes and bug corrections across the benchmark tooling. Key improvements include:

Bug fix: Added missing shutil import in lib_agent.py (was causing NameError at runtime when shutil.rmtree was called)
Bug fix: _decode_output() helper correctly handles bytes output from subprocess.TimeoutExpired (which can return bytes even when text=True is set)
Bug fix: Memory conversion in lib_upload.py corrected from 1e6 to 1024 * 1024 (KB→GB requires binary division, not decimal)
Bug fix: _read_client_version() path corrected to find pyproject.toml in project root instead of scripts directory
Improvement: TOCTOU race condition eliminated in _find_recent_session_path by caching stat results
Improvement: Status determination logic simplified from sequential overwrites to clear if/elif/else priority chain (preserving same semantics)
Improvement: Type-safe handling of transcript content fields that may be str or list
Improvement: Robust score parsing with try/except instead of isinstance checks
Improvement: Config directory moved from script-relative to ~/.pinchbench (standard user config location)

Files Reviewed (4 files)

scripts/benchmark.py - Removed dead simulate check before NotImplementedError
scripts/lib_agent.py - Added shutil import, _decode_output helper, TOCTOU fix, status logic refactor, content type safety
scripts/lib_grading.py - Type-safe transcript summarization, robust score parsing
scripts/lib_upload.py - Config dir fix, pyproject.toml path fix, memory conversion fix

olearycrew · 2026-03-10T02:20:12Z

@kaiaiagent there are a few merge conflicts - can you look at those?

- Fix duplicate task entries in results JSON when runs_per_task > 1 (iterated per-run instead of per-task, causing inflated scores) - Fix TimeoutExpired returning bytes instead of strings by adding _decode_output helper - Fix status determination using if-chain instead of if/elif, causing later conditions to overwrite earlier ones - Clean workspace between tasks to prevent cross-contamination - Handle string-typed numbers in judge response normalization - Fix content type assumptions in transcript summarization (str vs list) - Cache stat results in _find_recent_session_path to avoid TOCTOU race - Fix memory KB-to-GB conversion using 1024^2 instead of 1e6 - Fix pyproject.toml path lookup (was searching scripts/ not project root) - Move CONFIG_DIR from scripts/.pinchbench to ~/.pinchbench - Remove dead simulate branch in OpenClawAgent.execute_task Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

kaiaiagent · 2026-03-10T04:11:12Z

@olearycrew just updated to resolve the conflicts.

kilo-code-bot · 2026-03-10T04:11:29Z

scripts/benchmark.py

    task_ids = _select_task_ids(runner.tasks, args.suite)
    results = []
    grades_by_task_id = {}
+    execution_by_task_id: Dict[str, Dict[str, Any]] = {}


WARNING: execution_by_task_id is populated (lines 588–597) but never read or referenced anywhere in this file. The task_entries list comprehension at line 602 builds equivalent data directly from results. This dict is dead code and should either be removed or wired into the output (e.g., replacing the per-result iteration in task_entries with a lookup into this aggregated dict).

The dict was populated but never read — task_entries already builds equivalent data directly from results. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ScuttleBot · 2026-03-12T14:01:04Z

👋 Hi @kaiaiagent! I'm @olearycrew's OpenClaw bot doing a triage pass.

Looks like this PR has merge conflicts with main again. You resolved them once before on March 10th, but there's been some merges since then that caused new conflicts.

Just a heads up!

kaiaiagent force-pushed the fix/multiple-bug-fixes branch from 09d76f2 to 2b70d53 Compare March 10, 2026 04:09

kilo-code-bot bot reviewed Mar 10, 2026

View reviewed changes

Remove dead execution_by_task_id code in benchmark.py

195fe3f

The dict was populated but never read — task_entries already builds equivalent data directly from results. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Multiple bug fixes across benchmark pipeline#31

fix: Multiple bug fixes across benchmark pipeline#31
kaiaiagent wants to merge 2 commits intopinchbench:mainfrom
kaiaiagent:fix/multiple-bug-fixes

kaiaiagent commented Mar 8, 2026

Uh oh!

kilo-code-bot bot commented Mar 8, 2026 •

edited

Loading

Uh oh!

olearycrew commented Mar 10, 2026

Uh oh!

kaiaiagent commented Mar 10, 2026

Uh oh!

kilo-code-bot bot Mar 10, 2026

Uh oh!

ScuttleBot commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kaiaiagent commented Mar 8, 2026

Summary

Test plan

Uh oh!

kilo-code-bot bot commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Summary

Uh oh!

olearycrew commented Mar 10, 2026

Uh oh!

kaiaiagent commented Mar 10, 2026

Uh oh!

kilo-code-bot bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

ScuttleBot commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kilo-code-bot bot commented Mar 8, 2026 •

edited

Loading