Skip to content

fix: clean OpenClaw template files from judge workspace to prevent grading failures#59

Open
JackyCSer wants to merge 1 commit intopinchbench:mainfrom
JackyCSer:fix/judge-workspace-template-pollution
Open

fix: clean OpenClaw template files from judge workspace to prevent grading failures#59
JackyCSer wants to merge 1 commit intopinchbench:mainfrom
JackyCSer:fix/judge-workspace-template-pollution

Conversation

@JackyCSer
Copy link

Problem

When openclaw agents add creates the judge agent, it scaffolds AGENTS.md, SOUL.md, BOOTSTRAP.md and other template files into the workspace. These files instruct the agent to perform a bootstrap/personality flow (read SOUL.md, introduce itself, etc.) before processing any prompt.

This causes the judge agent to ignore the grading prompt entirely and output bootstrap text like I'm Claw 🦾, bootstrap complete... instead of the expected JSON scores. The JSON parser then fails with Failed to parse judge JSON response, giving 0 scores to all llm_judge and hybrid tasks — regardless of how well the tested model actually performed.

Impact

In our testing with doubao-seed-2.0-pro:

  • 13 out of 23 tasks were affected (7 pure llm_judge + 6 hybrid)
  • Reported score: 40% (9.19/23)
  • Estimated actual score: 70%+ (after excluding judge failures)
  • All affected tasks had breakdown: {} and notes: "" — the judge never evaluated them

Root Cause

openclaw agents add judge-agent
→ scaffolds AGENTS.md, SOUL.md, BOOTSTRAP.md into workspace
→ AGENTS.md says: "Before doing anything, read SOUL.md, USER.md..."
→ Judge receives grading prompt but executes bootstrap flow instead
→ Outputs personality text, not JSON
→ _parse_judge_response finds no JSON → returns {}
→ score = 0.0 for all judge-graded tasks

Note: prepare_task_workspace (for the tested agent) does shutil.rmtree before each task, so the tested agent is unaffected. Only the judge workspace persists these template files.

Fix

Add _clean_judge_workspace() that removes OpenClaw-scaffolded template files (AGENTS.md, SOUL.md, BOOTSTRAP.md, etc.) after agent creation, ensuring the judge operates as a pure grading function.

The cleanup runs on every _ensure_judge_agent call (idempotent), so it also handles cases where the agent already exists but the workspace was previously polluted.

Testing

  • Verified that after cleaning template files mid-run, subsequent llm_judge tasks received proper JSON scores
  • The fix is backward-compatible — if the workspace is already clean, it's a no-op

When `openclaw agents add` creates the judge agent, it scaffolds
AGENTS.md, SOUL.md, BOOTSTRAP.md and other template files into the
workspace. These files instruct the agent to perform a bootstrap /
personality flow (read SOUL.md, introduce itself, etc.) before
processing any prompt.

This causes the judge agent to ignore the grading prompt entirely
and output bootstrap text like 'I'm Claw, bootstrap complete...'
instead of the expected JSON scores. The JSON parser then fails with
'Failed to parse judge JSON response', giving 0 scores to all
llm_judge and hybrid tasks — regardless of how well the tested model
actually performed.

In our testing with doubao-seed-2.0-pro, 13 out of 23 tasks were
affected (7 pure llm_judge + 6 hybrid), severely underestimating the
model's actual score (reported 40% vs estimated 70%+ actual).

Fix: add _clean_judge_workspace() that removes scaffolded template
files after agent creation, ensuring the judge operates as a pure
grading function.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant