-
Notifications
You must be signed in to change notification settings - Fork 185
Description
Hi all.
I have repeatedly been encountering a pretty frustrating bug on complex, longer sessions with Claude.
I did some digging with Claude's help and got to this issue report consulting the contribution guidelines etc and running through things as deeply as I could.
Please let me know if there is anything I need to change or process to follow that I may be missing.
I have some code to reference here as well.
This also seems to relate to issues #842 and #844 but goes a bit further with replication/explanation, and tries to offer some potential solutions and maybe illuminates a bit more with the real world example provided.
This issue serves as reference and possible solution to what is outlined here #868 as a possible PR for discussion
Summary
When using streaming responses (messages.stream() or messages.create({ stream: true })), if the server stops sending SSE events mid-stream (due to network issues, server stalls, or connection problems), the client will wait indefinitely. There is currently no mechanism to detect and abort stalled streams.
Problem Description
Current Behavior: A Real-World Failure
The SDK provides two timeout mechanisms:
timeoutoption: Overall request timeout (default: 10 minutes)AbortController: Manual cancellation via signals
Neither mechanism detects or handles stalled streams. Here's what happened in a real production session:
The Incident
A Claude Code CLI session (f35c7c15-802e-462b-b468-b000a96e40bb) was performing a multi-step task using claude-opus-4-5-20251101. The session had been running successfully, accumulating 12,896 input tokens and 4,059 output tokens across multiple tool calls.
At 20:22:46 UTC, the API began streaming a response. The thinking block completed successfully:
{
"message": {
"model": "claude-opus-4-5-20251101",
"id": "msg_018BDP6sdKEzgJ5GS1rn7NW1",
"content": [
{
"type": "thinking",
"thinking": "I can see my active trace session `3c62b735`. It shows 12896 input tokens, 4059 output tokens..."
}
],
"stop_reason": null, // <-- Stream still in progress
"usage": {
"output_tokens": 4, // <-- Only 4 tokens output, thinking block only
"cache_read_input_tokens": 73111
}
},
"timestamp": "2025-12-19T20:22:46.441Z"
}The response was actively streaming (stop_reason: null), but after the thinking block, no more SSE events arrived. The stream didn't close, didn't error, didn't timeout—it simply stopped sending data.
What the User Experienced
The CLI appeared frozen. No output, no error, no indication anything was wrong:
$ ps aux | grep claude
PID %CPU TIME COMMAND
968 0.0 0:04.89 claude --dangerously-skip-permissions # 0% CPU - waiting forever
18881 0.0 0:02.15 claude --dangerously-skip-permissions # 0% CPU - waiting forever
Both processes at 0% CPU, sleeping indefinitely. The user's work was blocked with no way to know the session had stalled versus "still thinking."
Why Existing Timeouts Didn't Help
| Mechanism | Why It Failed |
|---|---|
timeout (10 min default) |
The request had already started receiving data, so the "request timeout" was satisfied |
AbortController |
Requires the user to manually abort—but how do they know it's stalled vs slow? |
| TCP keepalive | Connection was technically alive, just not sending data |
| OS-level timeouts | No socket error because the connection wasn't dead |
Timeline of the Failure
| Time (UTC) | Event | Evidence |
|---|---|---|
| 20:22:43 | Session active, tools executing | Trace shows 12896↓ 4059↑ tokens |
| 20:22:46 | Thinking block completes | content[0].type: "thinking" |
| 20:22:47 | Tool use begins | name: "Bash" in content |
| 20:22:56 | Last recorded SSE event | Final JSONL entry |
| 20:24:25+ | Stream stalls | No more events, stop_reason still null |
| 20:37:00+ | User notices hang | ~15 minutes of lost productivity |
Network State at Time of Hang
# Connections to Anthropic API (34.36.57.103:443)
tcp4 SYN_SENT 34.36.57.103:443 # New connection attempts stuck
tcp4 SYN_SENT 34.36.57.103:443 # Handshake never completing
tcp4 TIME_WAIT 34.36.57.103:443 # Recent connections closing normally
tcp4 TIME_WAIT 34.36.57.103:443 # (multiple)
The original streaming connection doesn't appear in ESTABLISHED state—suggesting it may have been silently dropped at the network layer while the SDK continued waiting for events.
The Core Problem
Timeline:
─────────────────────────────────────────────────────────────────────────────
0s Connection established, request sent ✓
0.5s First SSE event: message_start ✓
2s SSE event: thinking block ✓
5s SSE event: content_block_delta (partial) ✓
10s Last SSE event received ✓
10s+ ... server stops sending / network issue / proxy drop ...
SDK state: for await (const event of stream) { ... }
↑ Blocked here forever, no timeout fires
∞ Session unrecoverable, user must kill process ✗
Expected Behavior
The SDK should provide an idle timeout that:
- Tracks time since the last SSE event was received
- Aborts the stream if no event arrives within the configured duration
- Throws a descriptive error enabling retry or graceful failure
// With idle timeout, the above scenario would instead:
// - Detect no events for 90 seconds
// - Abort the stream
// - Throw StreamIdleTimeoutError
// - Allow the application to retry or fail gracefullyMinimal Reproduction
const stream = await client.messages.stream({
model: 'claude-sonnet-4-20250514',
max_tokens: 4096,
messages: [{ role: 'user', content: 'Write a long story...' }],
});
// If the server stops sending events mid-stream:
// - This loop blocks forever
// - No error is thrown
// - No timeout fires
// - The process must be killed manually
for await (const event of stream) {
console.log(event.type);
}
// Never reaches here if stream stallsImpact
This failure mode is particularly damaging because:
- Silent failure: No error, no indication anything is wrong
- Unrecoverable: Only solution is to kill the process
- Lost work: Any in-progress task state is lost
- Resource leak: Process hangs consuming memory/file handles
- User confusion: Impossible to distinguish "slow" from "stalled"
Proposed Solution
API Design
Add idleTimeout option to both ClientOptions (default) and RequestOptions (per-request override):
interface ClientOptions {
// Existing options...
timeout?: number; // Overall request timeout (existing)
idleTimeout?: number; // NEW: Max ms between SSE events (default: undefined = no limit)
}
interface RequestOptions {
// Existing options...
timeout?: number; // Per-request timeout override
idleTimeout?: number; // NEW: Per-request idle timeout override
}Usage
// Set default idle timeout for all streams
const client = new Anthropic({
idleTimeout: 90_000, // 90 seconds between events
});
// Override per-request
const stream = await client.messages.stream(params, {
idleTimeout: 120_000, // 2 minutes for this specific request
});
// Or via create()
const response = await client.messages.create({
stream: true,
// ...
}, {
idleTimeout: 60_000,
});Error Type
export class StreamIdleTimeoutError extends AnthropicError {
constructor(
public readonly idleTimeoutMs: number,
public readonly lastEventTime: Date,
public readonly eventCount: number
) {
super(`Stream idle timeout: no event received for ${idleTimeoutMs}ms`);
}
}Implementation Approach
The idle timeout should be implemented in MessageStream._createMessage() (and similarly in the Stream class iterator):
async *_createMessage(/* ... */): AsyncGenerator<MessageStreamEvent> {
const idleTimeout = options?.idleTimeout ?? this.#client?.idleTimeout;
let lastEventTime = Date.now();
let eventCount = 0;
let timeoutId: NodeJS.Timeout | undefined;
const checkIdle = () => {
const idleTime = Date.now() - lastEventTime;
if (idleTimeout && idleTime >= idleTimeout) {
this.controller.abort();
this.#handleError(new StreamIdleTimeoutError(
idleTimeout,
new Date(lastEventTime),
eventCount
));
return;
}
if (idleTimeout) {
timeoutId = setTimeout(checkIdle, Math.min(1000, idleTimeout - idleTime));
}
};
if (idleTimeout) {
timeoutId = setTimeout(checkIdle, idleTimeout);
}
try {
for await (const event of stream) {
lastEventTime = Date.now();
eventCount++;
yield event;
}
} finally {
if (timeoutId) clearTimeout(timeoutId);
}
}Type Contract Compatibility
This proposal:
- Adds optional properties only - fully backward compatible
- Follows existing patterns - mirrors
timeoutoption structure - No breaking changes - existing code works unchanged
- No
anytypes - strongly typed throughout:idleTimeout: number | undefinedStreamIdleTimeoutErrorextendsAnthropicError
Testing Strategy
Unit Tests
describe('MessageStream idle timeout', () => {
it('should timeout when no events received', async () => {
// Mock a stream that sends one event then stalls
mockFetch().mockResolvedValue(createStalledStream());
const stream = await client.messages.stream(params, { idleTimeout: 100 });
await expect(async () => {
for await (const event of stream) {}
}).rejects.toThrow(StreamIdleTimeoutError);
});
it('should reset timeout on each event', async () => {
// Mock a stream with slow but consistent events
mockFetch().mockResolvedValue(createSlowStream(50)); // 50ms between events
const stream = await client.messages.stream(params, { idleTimeout: 100 });
// Should complete successfully (events arrive before timeout)
const events = [];
for await (const event of stream) {
events.push(event);
}
expect(events.length).toBeGreaterThan(0);
});
it('should clean up timeout on normal completion', async () => {
const clearTimeoutSpy = jest.spyOn(global, 'clearTimeout');
const stream = await client.messages.stream(params, { idleTimeout: 1000 });
for await (const event of stream) {}
expect(clearTimeoutSpy).toHaveBeenCalled();
});
it('should clean up timeout on abort', async () => {
const controller = new AbortController();
const stream = await client.messages.stream(params, {
idleTimeout: 1000,
signal: controller.signal
});
controller.abort();
await expect(stream.done()).rejects.toThrow(APIUserAbortError);
// Verify no dangling timers
});
});Mock Helpers
function createStalledStream(): Response {
return new Response(
new ReadableStream({
start(controller) {
controller.enqueue(encoder.encode('event: message_start\ndata: {...}\n\n'));
// Never sends more events, never closes
}
}),
{ headers: { 'content-type': 'text/event-stream' } }
);
}
function createSlowStream(intervalMs: number): Response {
const events = ['message_start', 'content_block_start', 'content_block_delta', 'message_stop'];
let index = 0;
return new Response(
new ReadableStream({
async pull(controller) {
if (index < events.length) {
await sleep(intervalMs);
controller.enqueue(encoder.encode(`event: ${events[index]}\ndata: {...}\n\n`));
index++;
} else {
controller.close();
}
}
}),
{ headers: { 'content-type': 'text/event-stream' } }
);
}Alternatives Considered
1. Rely on OS-level TCP keepalive
- Problem: TCP keepalive only detects dead connections, not stalled streams on live connections
- Rejected: Doesn't solve the problem
2. Implement in application code (wrapper)
- Problem: Requires each consumer to implement their own timeout logic
- Rejected: Should be a first-class SDK feature
3. Use overall timeout option
- Problem: Can't distinguish between "slow but progressing" and "stalled"
- Rejected: Different use case - idle timeout complements overall timeout
Related Issues
- Streaming responses consistently interrupted mid-transmission - connection closes without message_stop event #842 - Streaming responses interrupted mid-transmission (different root cause: MCP timeouts)
- Long AI request always finish incomplete if you have a MCP connected to the client. #844 - Long requests incomplete with MCP (related but different issue)
This proposal addresses a distinct failure mode not covered by existing issues.
Implementation Scope
Files that would need changes:
src/internal/request-options.ts- AddidleTimeouttoRequestOptionssrc/client.ts- AddidleTimeouttoClientOptionssrc/core/streaming.ts- Implement idle timeout inStreamiteratorsrc/lib/MessageStream.ts- Implement idle timeout in_createMessage()src/core/error.tsor new file - AddStreamIdleTimeoutErrorclasstests/streaming.test.ts- Add idle timeout teststests/api-resources/MessageStream.test.ts- Add idle timeout tests- Update type exports
Questions for Maintainers
- Default value: Should
idleTimeouthave a default (e.g., 2 minutes) or be opt-in (undefined)? - Ping events: Should
pingSSE events reset the idle timer, or only "meaningful" events? - Retry integration: Should idle timeout trigger retry logic if
maxRetries > 0? - Naming: Is
idleTimeoutclear, or wouldstreamIdleTimeout/eventTimeoutbe better?