-
Notifications
You must be signed in to change notification settings - Fork 51
Description
The current benchmark flow appears to treat the return of:
subprocess.run(["openclaw", "agent", ...])as the signal that a task has finished executing.
However, this is not always a reliable completion signal. A transcript file may already exist, or the CLI process may return, while the agent output is still being flushed/written. As a result, benchmarking/grading can start before the task is actually complete.
Problem
Right now, task completion is effectively inferred from the openclaw agent subprocess returning, and _load_transcript() is only used to discover and parse the transcript afterward.
This can lead to a race condition:
openclaw agentreturns- transcript exists but is not yet complete
- benchmark reads partial transcript
- grading starts too early
Expected behavior
A task should only be considered complete after the transcript itself indicates that the final assistant message has finished.
Proposed solution
When loading/polling the transcript, explicitly check whether the end of the transcript contains an assistant message with:
"stopReason": "stop"For example:
{"type":"message","id":"5d72205e","parentId":"87b29e33","timestamp":"2026-03-18T00:48:53.813Z","message":{"role":"assistant","content":[{"type":"text","text":"期待您的 V2.0!🤝"}],"api":"openai-completions","usage":{"input":10539,"output":218,"cacheRead":0,"cacheWrite":0,"totalTokens":10757,"cost":{"input":0,"output":0,"cacheRead":0,"cacheWrite":0,"total":0}},"stopReason":"stop","timestamp":1773794928579}}Suggested implementation direction
Instead of only retrying until a transcript file is found, keep polling until one of the following is true:
- the transcript tail contains a final assistant message with
"stopReason":"stop" - the task timeout is reached
- the process clearly failed
This would make task completion detection depend on the transcript's terminal state, which is much closer to the real end of execution than subprocess return alone.
Why this is better
- avoids grading on partial transcripts
- avoids false "finished" states when output is still being written
- makes benchmark execution more robust against asynchronous or delayed transcript persistence