fix: clean OpenClaw template files from judge workspace to prevent grading failures by JackyCSer · Pull Request #59 · pinchbench/skill

JackyCSer · 2026-03-16T15:27:19Z

Problem

When openclaw agents add creates the judge agent, it scaffolds AGENTS.md, SOUL.md, BOOTSTRAP.md and other template files into the workspace. These files instruct the agent to perform a bootstrap/personality flow (read SOUL.md, introduce itself, etc.) before processing any prompt.

This causes the judge agent to ignore the grading prompt entirely and output bootstrap text like I'm Claw 🦾, bootstrap complete... instead of the expected JSON scores. The JSON parser then fails with Failed to parse judge JSON response, giving 0 scores to all llm_judge and hybrid tasks — regardless of how well the tested model actually performed.

Impact

In our testing with doubao-seed-2.0-pro:

13 out of 23 tasks were affected (7 pure llm_judge + 6 hybrid)
Reported score: 40% (9.19/23)
Estimated actual score: 70%+ (after excluding judge failures)
All affected tasks had breakdown: {} and notes: "" — the judge never evaluated them

Root Cause

openclaw agents add judge-agent
→ scaffolds AGENTS.md, SOUL.md, BOOTSTRAP.md into workspace
→ AGENTS.md says: "Before doing anything, read SOUL.md, USER.md..."
→ Judge receives grading prompt but executes bootstrap flow instead
→ Outputs personality text, not JSON
→ _parse_judge_response finds no JSON → returns {}
→ score = 0.0 for all judge-graded tasks

Note: prepare_task_workspace (for the tested agent) does shutil.rmtree before each task, so the tested agent is unaffected. Only the judge workspace persists these template files.

Fix

Add _clean_judge_workspace() that removes OpenClaw-scaffolded template files (AGENTS.md, SOUL.md, BOOTSTRAP.md, etc.) after agent creation, ensuring the judge operates as a pure grading function.

The cleanup runs on every _ensure_judge_agent call (idempotent), so it also handles cases where the agent already exists but the workspace was previously polluted.

Testing

Verified that after cleaning template files mid-run, subsequent llm_judge tasks received proper JSON scores
The fix is backward-compatible — if the workspace is already clean, it's a no-op

When `openclaw agents add` creates the judge agent, it scaffolds AGENTS.md, SOUL.md, BOOTSTRAP.md and other template files into the workspace. These files instruct the agent to perform a bootstrap / personality flow (read SOUL.md, introduce itself, etc.) before processing any prompt. This causes the judge agent to ignore the grading prompt entirely and output bootstrap text like 'I'm Claw, bootstrap complete...' instead of the expected JSON scores. The JSON parser then fails with 'Failed to parse judge JSON response', giving 0 scores to all llm_judge and hybrid tasks — regardless of how well the tested model actually performed. In our testing with doubao-seed-2.0-pro, 13 out of 23 tasks were affected (7 pure llm_judge + 6 hybrid), severely underestimating the model's actual score (reported 40% vs estimated 70%+ actual). Fix: add _clean_judge_workspace() that removes scaffolded template files after agent creation, ensuring the judge operates as a pure grading function.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: clean OpenClaw template files from judge workspace to prevent grading failures#59

fix: clean OpenClaw template files from judge workspace to prevent grading failures#59
JackyCSer wants to merge 1 commit intopinchbench:mainfrom
JackyCSer:fix/judge-workspace-template-pollution

JackyCSer commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JackyCSer commented Mar 16, 2026

Problem

Impact

Root Cause

Fix

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant