fix: clean OpenClaw template files from judge workspace to prevent grading failures#59
Open
JackyCSer wants to merge 1 commit intopinchbench:mainfrom
Open
Conversation
When `openclaw agents add` creates the judge agent, it scaffolds AGENTS.md, SOUL.md, BOOTSTRAP.md and other template files into the workspace. These files instruct the agent to perform a bootstrap / personality flow (read SOUL.md, introduce itself, etc.) before processing any prompt. This causes the judge agent to ignore the grading prompt entirely and output bootstrap text like 'I'm Claw, bootstrap complete...' instead of the expected JSON scores. The JSON parser then fails with 'Failed to parse judge JSON response', giving 0 scores to all llm_judge and hybrid tasks — regardless of how well the tested model actually performed. In our testing with doubao-seed-2.0-pro, 13 out of 23 tasks were affected (7 pure llm_judge + 6 hybrid), severely underestimating the model's actual score (reported 40% vs estimated 70%+ actual). Fix: add _clean_judge_workspace() that removes scaffolded template files after agent creation, ensuring the judge operates as a pure grading function.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When
openclaw agents addcreates the judge agent, it scaffoldsAGENTS.md,SOUL.md,BOOTSTRAP.mdand other template files into the workspace. These files instruct the agent to perform a bootstrap/personality flow (read SOUL.md, introduce itself, etc.) before processing any prompt.This causes the judge agent to ignore the grading prompt entirely and output bootstrap text like
I'm Claw 🦾, bootstrap complete...instead of the expected JSON scores. The JSON parser then fails withFailed to parse judge JSON response, giving 0 scores to allllm_judgeandhybridtasks — regardless of how well the tested model actually performed.Impact
In our testing with
doubao-seed-2.0-pro:llm_judge+ 6hybrid)breakdown: {}andnotes: ""— the judge never evaluated themRoot Cause
Note:
prepare_task_workspace(for the tested agent) doesshutil.rmtreebefore each task, so the tested agent is unaffected. Only the judge workspace persists these template files.Fix
Add
_clean_judge_workspace()that removes OpenClaw-scaffolded template files (AGENTS.md,SOUL.md,BOOTSTRAP.md, etc.) after agent creation, ensuring the judge operates as a pure grading function.The cleanup runs on every
_ensure_judge_agentcall (idempotent), so it also handles cases where the agent already exists but the workspace was previously polluted.Testing
llm_judgetasks received proper JSON scores