Skip to content

Switch eval harness to SDK streaming input mode to fix AskUserQuestion flakiness #69

@codevibesmatter

Description

@codevibesmatter

Summary

eval/harness.ts uses single-message input mode (prompt: string + options.resume), which doesn't support Query.interrupt() or mid-stream message injection. The 100ms abortController.abort() safety net and the "sessionId may not be set yet" patching are workarounds for a mode that doesn't fit the pause/resume use case.

The SDK's recommended path for interactive agents is streaming input mode: pass an AsyncIterable<SDKUserMessage> as prompt. That unlocks interrupt(), setPermissionMode(), and real mid-stream send — no cold-restart per turn.

Findings

  • Current codeeval/harness.ts:305-399, src/providers/claude.ts:111. SDK version: @anthropic-ai/claude-agent-sdk ^0.2.50.
  • Two documented workarounds already in the tree:
    • eval/harness.ts:262-264 — sessionId race between init message and canUseTool
    • eval/harness.ts:280-282 — 100ms setTimeout(abort, 100) safety net because interrupt: true from canUseTool is unreliable on its own
  • Per the official docs (code.claude.com/docs/en/agent-sdk/typescript), interrupt(), setPermissionMode(), setModel() are only available in streaming input mode. String-prompt mode gives us none of these.
  • Query.interrupt(): Promise<void> is awaitable and documented to "stop processing and return control to the caller" — stronger contract than our current deny+interrupt+abort triple.
  • In streaming mode, init is guaranteed to arrive on the stream before any tool dispatch, so the sessionId race disappears.

Follow-up

Migration outline (details in research doc §4):

  • Replace string prompt with an async generator + inbox queue at eval/harness.ts:331
  • Drop setTimeout(abort, 100) safety net in favor of await q.interrupt()
  • Remove the post-hoc sessionId patch at eval/harness.ts:387-390 (no longer needed)
  • Keep options.resume only for cross-process CLI boundary (process exit between runs)
  • Update canUseTool to 3-arg signature, log toolUseID in transcript
  • Spike: confirm canUseTool still fires for AskUserQuestion in streaming mode and that interrupt() alone stops the turn cleanly (no abort fallback needed)

Success criteria: no safety-net abort, no sessionId patching, single query() call handles in-process pause/answer/continue.

Research doc: planning/research/2026-04-23-sdk-mid-stream-messaging.md

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions