benchmark: wait for transcript completion instead of assuming subprocess exit means task finished

The current benchmark flow appears to treat the return of:

```python
subprocess.run(["openclaw", "agent", ...])
```

as the signal that a task has finished executing.

However, this is not always a reliable completion signal. A transcript file may already exist, or the CLI process may return, while the agent output is still being flushed/written. As a result, benchmarking/grading can start before the task is actually complete.

### Problem

Right now, task completion is effectively inferred from the `openclaw agent` subprocess returning, and `_load_transcript()` is only used to discover and parse the transcript afterward.

This can lead to a race condition:

1. `openclaw agent` returns
2. transcript exists but is not yet complete
3. benchmark reads partial transcript
4. grading starts too early

### Expected behavior

A task should only be considered complete after the transcript itself indicates that the final assistant message has finished.

### Proposed solution

When loading/polling the transcript, explicitly check whether the end of the transcript contains an assistant message with:

```json
"stopReason": "stop"
```

For example:

```json
{"type":"message","id":"5d72205e","parentId":"87b29e33","timestamp":"2026-03-18T00:48:53.813Z","message":{"role":"assistant","content":[{"type":"text","text":"期待您的 V2.0！🤝"}],"api":"openai-completions","usage":{"input":10539,"output":218,"cacheRead":0,"cacheWrite":0,"totalTokens":10757,"cost":{"input":0,"output":0,"cacheRead":0,"cacheWrite":0,"total":0}},"stopReason":"stop","timestamp":1773794928579}}
```

### Suggested implementation direction

Instead of only retrying until a transcript file is found, keep polling until one of the following is true:

1. the transcript tail contains a final assistant message with `"stopReason":"stop"`
2. the task timeout is reached
3. the process clearly failed

This would make task completion detection depend on the transcript's terminal state, which is much closer to the real end of execution than subprocess return alone.

### Why this is better

- avoids grading on partial transcripts
- avoids false "finished" states when output is still being written
- makes benchmark execution more robust against asynchronous or delayed transcript persistence



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmark: wait for transcript completion instead of assuming subprocess exit means task finished #65

Problem

Expected behavior

Proposed solution

Suggested implementation direction

Why this is better

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

benchmark: wait for transcript completion instead of assuming subprocess exit means task finished #65

Description

Problem

Expected behavior

Proposed solution

Suggested implementation direction

Why this is better

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions