Skip to content

bug/proposal: infinitely hanging clients breaking bigger/complex sessions. Proposal: add Streaming Idle Timeout to Prevent Indefinite Hangs #867

@notactuallytreyanastasio

Description

Hi all.

I have repeatedly been encountering a pretty frustrating bug on complex, longer sessions with Claude.

I did some digging with Claude's help and got to this issue report consulting the contribution guidelines etc and running through things as deeply as I could.

Please let me know if there is anything I need to change or process to follow that I may be missing.

I have some code to reference here as well.

This also seems to relate to issues #842 and #844 but goes a bit further with replication/explanation, and tries to offer some potential solutions and maybe illuminates a bit more with the real world example provided.

This issue serves as reference and possible solution to what is outlined here #868 as a possible PR for discussion

Summary

When using streaming responses (messages.stream() or messages.create({ stream: true })), if the server stops sending SSE events mid-stream (due to network issues, server stalls, or connection problems), the client will wait indefinitely. There is currently no mechanism to detect and abort stalled streams.

Problem Description

Current Behavior: A Real-World Failure

The SDK provides two timeout mechanisms:

  • timeout option: Overall request timeout (default: 10 minutes)
  • AbortController: Manual cancellation via signals

Neither mechanism detects or handles stalled streams. Here's what happened in a real production session:

The Incident

A Claude Code CLI session (f35c7c15-802e-462b-b468-b000a96e40bb) was performing a multi-step task using claude-opus-4-5-20251101. The session had been running successfully, accumulating 12,896 input tokens and 4,059 output tokens across multiple tool calls.

At 20:22:46 UTC, the API began streaming a response. The thinking block completed successfully:

{
  "message": {
    "model": "claude-opus-4-5-20251101",
    "id": "msg_018BDP6sdKEzgJ5GS1rn7NW1",
    "content": [
      {
        "type": "thinking",
        "thinking": "I can see my active trace session `3c62b735`. It shows 12896 input tokens, 4059 output tokens..."
      }
    ],
    "stop_reason": null,    // <-- Stream still in progress
    "usage": {
      "output_tokens": 4,   // <-- Only 4 tokens output, thinking block only
      "cache_read_input_tokens": 73111
    }
  },
  "timestamp": "2025-12-19T20:22:46.441Z"
}

The response was actively streaming (stop_reason: null), but after the thinking block, no more SSE events arrived. The stream didn't close, didn't error, didn't timeout—it simply stopped sending data.

What the User Experienced

The CLI appeared frozen. No output, no error, no indication anything was wrong:

$ ps aux | grep claude
  PID   %CPU  TIME     COMMAND
  968   0.0   0:04.89  claude --dangerously-skip-permissions  # 0% CPU - waiting forever
  18881 0.0   0:02.15  claude --dangerously-skip-permissions  # 0% CPU - waiting forever

Both processes at 0% CPU, sleeping indefinitely. The user's work was blocked with no way to know the session had stalled versus "still thinking."

Why Existing Timeouts Didn't Help

Mechanism Why It Failed
timeout (10 min default) The request had already started receiving data, so the "request timeout" was satisfied
AbortController Requires the user to manually abort—but how do they know it's stalled vs slow?
TCP keepalive Connection was technically alive, just not sending data
OS-level timeouts No socket error because the connection wasn't dead

Timeline of the Failure

Time (UTC) Event Evidence
20:22:43 Session active, tools executing Trace shows 12896↓ 4059↑ tokens
20:22:46 Thinking block completes content[0].type: "thinking"
20:22:47 Tool use begins name: "Bash" in content
20:22:56 Last recorded SSE event Final JSONL entry
20:24:25+ Stream stalls No more events, stop_reason still null
20:37:00+ User notices hang ~15 minutes of lost productivity

Network State at Time of Hang

# Connections to Anthropic API (34.36.57.103:443)
tcp4  SYN_SENT   34.36.57.103:443  # New connection attempts stuck
tcp4  SYN_SENT   34.36.57.103:443  # Handshake never completing
tcp4  TIME_WAIT  34.36.57.103:443  # Recent connections closing normally
tcp4  TIME_WAIT  34.36.57.103:443  # (multiple)

The original streaming connection doesn't appear in ESTABLISHED state—suggesting it may have been silently dropped at the network layer while the SDK continued waiting for events.

The Core Problem

Timeline:
─────────────────────────────────────────────────────────────────────────────
0s          Connection established, request sent                    ✓
0.5s        First SSE event: message_start                          ✓
2s          SSE event: thinking block                               ✓
5s          SSE event: content_block_delta (partial)                ✓
10s         Last SSE event received                                 ✓
10s+        ... server stops sending / network issue / proxy drop ...

            SDK state: for await (const event of stream) { ... }
                       ↑ Blocked here forever, no timeout fires

∞           Session unrecoverable, user must kill process           ✗

Expected Behavior

The SDK should provide an idle timeout that:

  1. Tracks time since the last SSE event was received
  2. Aborts the stream if no event arrives within the configured duration
  3. Throws a descriptive error enabling retry or graceful failure
// With idle timeout, the above scenario would instead:
// - Detect no events for 90 seconds
// - Abort the stream
// - Throw StreamIdleTimeoutError
// - Allow the application to retry or fail gracefully

Minimal Reproduction

const stream = await client.messages.stream({
  model: 'claude-sonnet-4-20250514',
  max_tokens: 4096,
  messages: [{ role: 'user', content: 'Write a long story...' }],
});

// If the server stops sending events mid-stream:
// - This loop blocks forever
// - No error is thrown
// - No timeout fires
// - The process must be killed manually
for await (const event of stream) {
  console.log(event.type);
}
// Never reaches here if stream stalls

Impact

This failure mode is particularly damaging because:

  1. Silent failure: No error, no indication anything is wrong
  2. Unrecoverable: Only solution is to kill the process
  3. Lost work: Any in-progress task state is lost
  4. Resource leak: Process hangs consuming memory/file handles
  5. User confusion: Impossible to distinguish "slow" from "stalled"

Proposed Solution

API Design

Add idleTimeout option to both ClientOptions (default) and RequestOptions (per-request override):

interface ClientOptions {
  // Existing options...
  timeout?: number;        // Overall request timeout (existing)
  idleTimeout?: number;    // NEW: Max ms between SSE events (default: undefined = no limit)
}

interface RequestOptions {
  // Existing options...
  timeout?: number;        // Per-request timeout override
  idleTimeout?: number;    // NEW: Per-request idle timeout override
}

Usage

// Set default idle timeout for all streams
const client = new Anthropic({
  idleTimeout: 90_000, // 90 seconds between events
});

// Override per-request
const stream = await client.messages.stream(params, {
  idleTimeout: 120_000, // 2 minutes for this specific request
});

// Or via create()
const response = await client.messages.create({
  stream: true,
  // ...
}, {
  idleTimeout: 60_000,
});

Error Type

export class StreamIdleTimeoutError extends AnthropicError {
  constructor(
    public readonly idleTimeoutMs: number,
    public readonly lastEventTime: Date,
    public readonly eventCount: number
  ) {
    super(`Stream idle timeout: no event received for ${idleTimeoutMs}ms`);
  }
}

Implementation Approach

The idle timeout should be implemented in MessageStream._createMessage() (and similarly in the Stream class iterator):

async *_createMessage(/* ... */): AsyncGenerator<MessageStreamEvent> {
  const idleTimeout = options?.idleTimeout ?? this.#client?.idleTimeout;

  let lastEventTime = Date.now();
  let eventCount = 0;
  let timeoutId: NodeJS.Timeout | undefined;

  const checkIdle = () => {
    const idleTime = Date.now() - lastEventTime;
    if (idleTimeout && idleTime >= idleTimeout) {
      this.controller.abort();
      this.#handleError(new StreamIdleTimeoutError(
        idleTimeout,
        new Date(lastEventTime),
        eventCount
      ));
      return;
    }
    if (idleTimeout) {
      timeoutId = setTimeout(checkIdle, Math.min(1000, idleTimeout - idleTime));
    }
  };

  if (idleTimeout) {
    timeoutId = setTimeout(checkIdle, idleTimeout);
  }

  try {
    for await (const event of stream) {
      lastEventTime = Date.now();
      eventCount++;
      yield event;
    }
  } finally {
    if (timeoutId) clearTimeout(timeoutId);
  }
}

Type Contract Compatibility

This proposal:

  1. Adds optional properties only - fully backward compatible
  2. Follows existing patterns - mirrors timeout option structure
  3. No breaking changes - existing code works unchanged
  4. No any types - strongly typed throughout:
    • idleTimeout: number | undefined
    • StreamIdleTimeoutError extends AnthropicError

Testing Strategy

Unit Tests

describe('MessageStream idle timeout', () => {
  it('should timeout when no events received', async () => {
    // Mock a stream that sends one event then stalls
    mockFetch().mockResolvedValue(createStalledStream());

    const stream = await client.messages.stream(params, { idleTimeout: 100 });

    await expect(async () => {
      for await (const event of stream) {}
    }).rejects.toThrow(StreamIdleTimeoutError);
  });

  it('should reset timeout on each event', async () => {
    // Mock a stream with slow but consistent events
    mockFetch().mockResolvedValue(createSlowStream(50)); // 50ms between events

    const stream = await client.messages.stream(params, { idleTimeout: 100 });

    // Should complete successfully (events arrive before timeout)
    const events = [];
    for await (const event of stream) {
      events.push(event);
    }
    expect(events.length).toBeGreaterThan(0);
  });

  it('should clean up timeout on normal completion', async () => {
    const clearTimeoutSpy = jest.spyOn(global, 'clearTimeout');

    const stream = await client.messages.stream(params, { idleTimeout: 1000 });
    for await (const event of stream) {}

    expect(clearTimeoutSpy).toHaveBeenCalled();
  });

  it('should clean up timeout on abort', async () => {
    const controller = new AbortController();
    const stream = await client.messages.stream(params, {
      idleTimeout: 1000,
      signal: controller.signal
    });

    controller.abort();

    await expect(stream.done()).rejects.toThrow(APIUserAbortError);
    // Verify no dangling timers
  });
});

Mock Helpers

function createStalledStream(): Response {
  return new Response(
    new ReadableStream({
      start(controller) {
        controller.enqueue(encoder.encode('event: message_start\ndata: {...}\n\n'));
        // Never sends more events, never closes
      }
    }),
    { headers: { 'content-type': 'text/event-stream' } }
  );
}

function createSlowStream(intervalMs: number): Response {
  const events = ['message_start', 'content_block_start', 'content_block_delta', 'message_stop'];
  let index = 0;

  return new Response(
    new ReadableStream({
      async pull(controller) {
        if (index < events.length) {
          await sleep(intervalMs);
          controller.enqueue(encoder.encode(`event: ${events[index]}\ndata: {...}\n\n`));
          index++;
        } else {
          controller.close();
        }
      }
    }),
    { headers: { 'content-type': 'text/event-stream' } }
  );
}

Alternatives Considered

1. Rely on OS-level TCP keepalive

  • Problem: TCP keepalive only detects dead connections, not stalled streams on live connections
  • Rejected: Doesn't solve the problem

2. Implement in application code (wrapper)

  • Problem: Requires each consumer to implement their own timeout logic
  • Rejected: Should be a first-class SDK feature

3. Use overall timeout option

  • Problem: Can't distinguish between "slow but progressing" and "stalled"
  • Rejected: Different use case - idle timeout complements overall timeout

Related Issues

This proposal addresses a distinct failure mode not covered by existing issues.

Implementation Scope

Files that would need changes:

  1. src/internal/request-options.ts - Add idleTimeout to RequestOptions
  2. src/client.ts - Add idleTimeout to ClientOptions
  3. src/core/streaming.ts - Implement idle timeout in Stream iterator
  4. src/lib/MessageStream.ts - Implement idle timeout in _createMessage()
  5. src/core/error.ts or new file - Add StreamIdleTimeoutError class
  6. tests/streaming.test.ts - Add idle timeout tests
  7. tests/api-resources/MessageStream.test.ts - Add idle timeout tests
  8. Update type exports

Questions for Maintainers

  1. Default value: Should idleTimeout have a default (e.g., 2 minutes) or be opt-in (undefined)?
  2. Ping events: Should ping SSE events reset the idle timer, or only "meaningful" events?
  3. Retry integration: Should idle timeout trigger retry logic if maxRetries > 0?
  4. Naming: Is idleTimeout clear, or would streamIdleTimeout / eventTimeout be better?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions