Skip to content

benchmark: wait for transcript completion instead of assuming subprocess exit means task finished #65

@jiaxin576

Description

@jiaxin576

The current benchmark flow appears to treat the return of:

subprocess.run(["openclaw", "agent", ...])

as the signal that a task has finished executing.

However, this is not always a reliable completion signal. A transcript file may already exist, or the CLI process may return, while the agent output is still being flushed/written. As a result, benchmarking/grading can start before the task is actually complete.

Problem

Right now, task completion is effectively inferred from the openclaw agent subprocess returning, and _load_transcript() is only used to discover and parse the transcript afterward.

This can lead to a race condition:

  1. openclaw agent returns
  2. transcript exists but is not yet complete
  3. benchmark reads partial transcript
  4. grading starts too early

Expected behavior

A task should only be considered complete after the transcript itself indicates that the final assistant message has finished.

Proposed solution

When loading/polling the transcript, explicitly check whether the end of the transcript contains an assistant message with:

"stopReason": "stop"

For example:

{"type":"message","id":"5d72205e","parentId":"87b29e33","timestamp":"2026-03-18T00:48:53.813Z","message":{"role":"assistant","content":[{"type":"text","text":"期待您的 V2.0!🤝"}],"api":"openai-completions","usage":{"input":10539,"output":218,"cacheRead":0,"cacheWrite":0,"totalTokens":10757,"cost":{"input":0,"output":0,"cacheRead":0,"cacheWrite":0,"total":0}},"stopReason":"stop","timestamp":1773794928579}}

Suggested implementation direction

Instead of only retrying until a transcript file is found, keep polling until one of the following is true:

  1. the transcript tail contains a final assistant message with "stopReason":"stop"
  2. the task timeout is reached
  3. the process clearly failed

This would make task completion detection depend on the transcript's terminal state, which is much closer to the real end of execution than subprocess return alone.

Why this is better

  • avoids grading on partial transcripts
  • avoids false "finished" states when output is still being written
  • makes benchmark execution more robust against asynchronous or delayed transcript persistence

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions