Switch eval harness to SDK streaming input mode to fix AskUserQuestion flakiness

## Summary

`eval/harness.ts` uses single-message input mode (`prompt: string` + `options.resume`), which doesn't support `Query.interrupt()` or mid-stream message injection. The 100ms `abortController.abort()` safety net and the "sessionId may not be set yet" patching are workarounds for a mode that doesn't fit the pause/resume use case.

The SDK's recommended path for interactive agents is **streaming input mode**: pass an `AsyncIterable<SDKUserMessage>` as `prompt`. That unlocks `interrupt()`, `setPermissionMode()`, and real mid-stream send — no cold-restart per turn.

## Findings

- **Current code** — `eval/harness.ts:305-399`, `src/providers/claude.ts:111`. SDK version: `@anthropic-ai/claude-agent-sdk ^0.2.50`.
- **Two documented workarounds already in the tree:**
  - `eval/harness.ts:262-264` — sessionId race between `init` message and `canUseTool`
  - `eval/harness.ts:280-282` — 100ms `setTimeout(abort, 100)` safety net because `interrupt: true` from canUseTool is unreliable on its own
- **Per the official docs** (`code.claude.com/docs/en/agent-sdk/typescript`), `interrupt()`, `setPermissionMode()`, `setModel()` are **only available in streaming input mode**. String-prompt mode gives us none of these.
- **`Query.interrupt(): Promise<void>`** is awaitable and documented to "stop processing and return control to the caller" — stronger contract than our current deny+interrupt+abort triple.
- In streaming mode, `init` is guaranteed to arrive on the stream before any tool dispatch, so the sessionId race disappears.

## Follow-up

Migration outline (details in research doc §4):

- [ ] Replace string `prompt` with an async generator + inbox queue at `eval/harness.ts:331`
- [ ] Drop `setTimeout(abort, 100)` safety net in favor of `await q.interrupt()`
- [ ] Remove the post-hoc sessionId patch at `eval/harness.ts:387-390` (no longer needed)
- [ ] Keep `options.resume` only for cross-process CLI boundary (process exit between runs)
- [ ] Update `canUseTool` to 3-arg signature, log `toolUseID` in transcript
- [ ] Spike: confirm `canUseTool` still fires for `AskUserQuestion` in streaming mode and that `interrupt()` alone stops the turn cleanly (no abort fallback needed)

**Success criteria:** no safety-net abort, no sessionId patching, single `query()` call handles in-process pause/answer/continue.

Research doc: planning/research/2026-04-23-sdk-mid-stream-messaging.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch eval harness to SDK streaming input mode to fix AskUserQuestion flakiness #69

Summary

Findings

Follow-up

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Switch eval harness to SDK streaming input mode to fix AskUserQuestion flakiness #69

Description

Summary

Findings

Follow-up

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions