bug/proposal: infinitely hanging clients breaking bigger/complex sessions. Proposal: add Streaming Idle Timeout to Prevent Indefinite Hangs

Hi all.

I have repeatedly been encountering a pretty frustrating bug on complex, longer sessions with Claude.

I did some digging with Claude's help and got to this issue report consulting the contribution guidelines etc and running through things as deeply as I could.

Please let me know if there is anything I need to change or process to follow that I may be missing.

I have some code to reference here as well.

This also seems to relate to issues #842 and #844 but goes a bit further with replication/explanation, and tries to offer some potential solutions and maybe illuminates a bit more with the real world example provided.


## This issue serves as reference and possible solution to what is outlined here #868 as a possible PR for discussion

## Summary

When using streaming responses (`messages.stream()` or `messages.create({ stream: true })`), if the server stops sending SSE events mid-stream (due to network issues, server stalls, or connection problems), the client will wait indefinitely. There is currently no mechanism to detect and abort stalled streams.

## Problem Description

### Current Behavior: A Real-World Failure

The SDK provides two timeout mechanisms:
- **`timeout`** option: Overall request timeout (default: 10 minutes)
- **`AbortController`**: Manual cancellation via signals

**Neither mechanism detects or handles stalled streams.** Here's what happened in a real production session:

#### The Incident

A Claude Code CLI session (`f35c7c15-802e-462b-b468-b000a96e40bb`) was performing a multi-step task using `claude-opus-4-5-20251101`. The session had been running successfully, accumulating 12,896 input tokens and 4,059 output tokens across multiple tool calls.

At 20:22:46 UTC, the API began streaming a response. The thinking block completed successfully:

```json
{
  "message": {
    "model": "claude-opus-4-5-20251101",
    "id": "msg_018BDP6sdKEzgJ5GS1rn7NW1",
    "content": [
      {
        "type": "thinking",
        "thinking": "I can see my active trace session `3c62b735`. It shows 12896 input tokens, 4059 output tokens..."
      }
    ],
    "stop_reason": null,    // <-- Stream still in progress
    "usage": {
      "output_tokens": 4,   // <-- Only 4 tokens output, thinking block only
      "cache_read_input_tokens": 73111
    }
  },
  "timestamp": "2025-12-19T20:22:46.441Z"
}
```

The response was actively streaming (`stop_reason: null`), but after the thinking block, **no more SSE events arrived**. The stream didn't close, didn't error, didn't timeout—it simply stopped sending data.

#### What the User Experienced

The CLI appeared frozen. No output, no error, no indication anything was wrong:

```
$ ps aux | grep claude
  PID   %CPU  TIME     COMMAND
  968   0.0   0:04.89  claude --dangerously-skip-permissions  # 0% CPU - waiting forever
  18881 0.0   0:02.15  claude --dangerously-skip-permissions  # 0% CPU - waiting forever
```

Both processes at **0% CPU, sleeping indefinitely**. The user's work was blocked with no way to know the session had stalled versus "still thinking."

#### Why Existing Timeouts Didn't Help

| Mechanism | Why It Failed |
|-----------|---------------|
| `timeout` (10 min default) | The request had already started receiving data, so the "request timeout" was satisfied |
| `AbortController` | Requires the user to manually abort—but how do they know it's stalled vs slow? |
| TCP keepalive | Connection was technically alive, just not sending data |
| OS-level timeouts | No socket error because the connection wasn't dead |

#### Timeline of the Failure

| Time (UTC) | Event | Evidence |
|------------|-------|----------|
| 20:22:43 | Session active, tools executing | Trace shows 12896↓ 4059↑ tokens |
| 20:22:46 | Thinking block completes | `content[0].type: "thinking"` |
| 20:22:47 | Tool use begins | `name: "Bash"` in content |
| 20:22:56 | Last recorded SSE event | Final JSONL entry |
| 20:24:25+ | **Stream stalls** | No more events, `stop_reason` still `null` |
| 20:37:00+ | User notices hang | ~15 minutes of lost productivity |

#### Network State at Time of Hang

```
# Connections to Anthropic API (34.36.57.103:443)
tcp4  SYN_SENT   34.36.57.103:443  # New connection attempts stuck
tcp4  SYN_SENT   34.36.57.103:443  # Handshake never completing
tcp4  TIME_WAIT  34.36.57.103:443  # Recent connections closing normally
tcp4  TIME_WAIT  34.36.57.103:443  # (multiple)
```

The original streaming connection doesn't appear in `ESTABLISHED` state—suggesting it may have been silently dropped at the network layer while the SDK continued waiting for events.

#### The Core Problem

```
Timeline:
─────────────────────────────────────────────────────────────────────────────
0s          Connection established, request sent                    ✓
0.5s        First SSE event: message_start                          ✓
2s          SSE event: thinking block                               ✓
5s          SSE event: content_block_delta (partial)                ✓
10s         Last SSE event received                                 ✓
10s+        ... server stops sending / network issue / proxy drop ...

            SDK state: for await (const event of stream) { ... }
                       ↑ Blocked here forever, no timeout fires

∞           Session unrecoverable, user must kill process           ✗
```

### Expected Behavior

The SDK should provide an **idle timeout** that:
1. Tracks time since the last SSE event was received
2. Aborts the stream if no event arrives within the configured duration
3. Throws a descriptive error enabling retry or graceful failure

```typescript
// With idle timeout, the above scenario would instead:
// - Detect no events for 90 seconds
// - Abort the stream
// - Throw StreamIdleTimeoutError
// - Allow the application to retry or fail gracefully
```

### Minimal Reproduction

```typescript
const stream = await client.messages.stream({
  model: 'claude-sonnet-4-20250514',
  max_tokens: 4096,
  messages: [{ role: 'user', content: 'Write a long story...' }],
});

// If the server stops sending events mid-stream:
// - This loop blocks forever
// - No error is thrown
// - No timeout fires
// - The process must be killed manually
for await (const event of stream) {
  console.log(event.type);
}
// Never reaches here if stream stalls
```

### Impact

This failure mode is particularly damaging because:

1. **Silent failure**: No error, no indication anything is wrong
2. **Unrecoverable**: Only solution is to kill the process
3. **Lost work**: Any in-progress task state is lost
4. **Resource leak**: Process hangs consuming memory/file handles
5. **User confusion**: Impossible to distinguish "slow" from "stalled"

## Proposed Solution

### API Design

Add `idleTimeout` option to both `ClientOptions` (default) and `RequestOptions` (per-request override):

```typescript
interface ClientOptions {
  // Existing options...
  timeout?: number;        // Overall request timeout (existing)
  idleTimeout?: number;    // NEW: Max ms between SSE events (default: undefined = no limit)
}

interface RequestOptions {
  // Existing options...
  timeout?: number;        // Per-request timeout override
  idleTimeout?: number;    // NEW: Per-request idle timeout override
}
```

### Usage

```typescript
// Set default idle timeout for all streams
const client = new Anthropic({
  idleTimeout: 90_000, // 90 seconds between events
});

// Override per-request
const stream = await client.messages.stream(params, {
  idleTimeout: 120_000, // 2 minutes for this specific request
});

// Or via create()
const response = await client.messages.create({
  stream: true,
  // ...
}, {
  idleTimeout: 60_000,
});
```

### Error Type

```typescript
export class StreamIdleTimeoutError extends AnthropicError {
  constructor(
    public readonly idleTimeoutMs: number,
    public readonly lastEventTime: Date,
    public readonly eventCount: number
  ) {
    super(`Stream idle timeout: no event received for ${idleTimeoutMs}ms`);
  }
}
```

### Implementation Approach

The idle timeout should be implemented in `MessageStream._createMessage()` (and similarly in the `Stream` class iterator):

```typescript
async *_createMessage(/* ... */): AsyncGenerator<MessageStreamEvent> {
  const idleTimeout = options?.idleTimeout ?? this.#client?.idleTimeout;

  let lastEventTime = Date.now();
  let eventCount = 0;
  let timeoutId: NodeJS.Timeout | undefined;

  const checkIdle = () => {
    const idleTime = Date.now() - lastEventTime;
    if (idleTimeout && idleTime >= idleTimeout) {
      this.controller.abort();
      this.#handleError(new StreamIdleTimeoutError(
        idleTimeout,
        new Date(lastEventTime),
        eventCount
      ));
      return;
    }
    if (idleTimeout) {
      timeoutId = setTimeout(checkIdle, Math.min(1000, idleTimeout - idleTime));
    }
  };

  if (idleTimeout) {
    timeoutId = setTimeout(checkIdle, idleTimeout);
  }

  try {
    for await (const event of stream) {
      lastEventTime = Date.now();
      eventCount++;
      yield event;
    }
  } finally {
    if (timeoutId) clearTimeout(timeoutId);
  }
}
```

## Type Contract Compatibility

This proposal:

1. **Adds optional properties only** - fully backward compatible
2. **Follows existing patterns** - mirrors `timeout` option structure
3. **No breaking changes** - existing code works unchanged
4. **No `any` types** - strongly typed throughout:
   - `idleTimeout: number | undefined`
   - `StreamIdleTimeoutError` extends `AnthropicError`

## Testing Strategy

### Unit Tests

```typescript
describe('MessageStream idle timeout', () => {
  it('should timeout when no events received', async () => {
    // Mock a stream that sends one event then stalls
    mockFetch().mockResolvedValue(createStalledStream());

    const stream = await client.messages.stream(params, { idleTimeout: 100 });

    await expect(async () => {
      for await (const event of stream) {}
    }).rejects.toThrow(StreamIdleTimeoutError);
  });

  it('should reset timeout on each event', async () => {
    // Mock a stream with slow but consistent events
    mockFetch().mockResolvedValue(createSlowStream(50)); // 50ms between events

    const stream = await client.messages.stream(params, { idleTimeout: 100 });

    // Should complete successfully (events arrive before timeout)
    const events = [];
    for await (const event of stream) {
      events.push(event);
    }
    expect(events.length).toBeGreaterThan(0);
  });

  it('should clean up timeout on normal completion', async () => {
    const clearTimeoutSpy = jest.spyOn(global, 'clearTimeout');

    const stream = await client.messages.stream(params, { idleTimeout: 1000 });
    for await (const event of stream) {}

    expect(clearTimeoutSpy).toHaveBeenCalled();
  });

  it('should clean up timeout on abort', async () => {
    const controller = new AbortController();
    const stream = await client.messages.stream(params, {
      idleTimeout: 1000,
      signal: controller.signal
    });

    controller.abort();

    await expect(stream.done()).rejects.toThrow(APIUserAbortError);
    // Verify no dangling timers
  });
});
```

### Mock Helpers

```typescript
function createStalledStream(): Response {
  return new Response(
    new ReadableStream({
      start(controller) {
        controller.enqueue(encoder.encode('event: message_start\ndata: {...}\n\n'));
        // Never sends more events, never closes
      }
    }),
    { headers: { 'content-type': 'text/event-stream' } }
  );
}

function createSlowStream(intervalMs: number): Response {
  const events = ['message_start', 'content_block_start', 'content_block_delta', 'message_stop'];
  let index = 0;

  return new Response(
    new ReadableStream({
      async pull(controller) {
        if (index < events.length) {
          await sleep(intervalMs);
          controller.enqueue(encoder.encode(`event: ${events[index]}\ndata: {...}\n\n`));
          index++;
        } else {
          controller.close();
        }
      }
    }),
    { headers: { 'content-type': 'text/event-stream' } }
  );
}
```

## Alternatives Considered

### 1. Rely on OS-level TCP keepalive
- **Problem**: TCP keepalive only detects dead connections, not stalled streams on live connections
- **Rejected**: Doesn't solve the problem

### 2. Implement in application code (wrapper)
- **Problem**: Requires each consumer to implement their own timeout logic
- **Rejected**: Should be a first-class SDK feature

### 3. Use overall `timeout` option
- **Problem**: Can't distinguish between "slow but progressing" and "stalled"
- **Rejected**: Different use case - idle timeout complements overall timeout

## Related Issues

- #842 - Streaming responses interrupted mid-transmission (different root cause: MCP timeouts)
- #844 - Long requests incomplete with MCP (related but different issue)

This proposal addresses a distinct failure mode not covered by existing issues.

## Implementation Scope

Files that would need changes:
1. `src/internal/request-options.ts` - Add `idleTimeout` to `RequestOptions`
2. `src/client.ts` - Add `idleTimeout` to `ClientOptions`
3. `src/core/streaming.ts` - Implement idle timeout in `Stream` iterator
4. `src/lib/MessageStream.ts` - Implement idle timeout in `_createMessage()`
5. `src/core/error.ts` or new file - Add `StreamIdleTimeoutError` class
6. `tests/streaming.test.ts` - Add idle timeout tests
7. `tests/api-resources/MessageStream.test.ts` - Add idle timeout tests
8. Update type exports

## Questions for Maintainers

1. **Default value**: Should `idleTimeout` have a default (e.g., 2 minutes) or be opt-in (`undefined`)?
2. **Ping events**: Should `ping` SSE events reset the idle timer, or only "meaningful" events?
3. **Retry integration**: Should idle timeout trigger retry logic if `maxRetries > 0`?
4. **Naming**: Is `idleTimeout` clear, or would `streamIdleTimeout` / `eventTimeout` be better?


Mechanism	Why It Failed
`timeout` (10 min default)	The request had already started receiving data, so the "request timeout" was satisfied
`AbortController`	Requires the user to manually abort—but how do they know it's stalled vs slow?
TCP keepalive	Connection was technically alive, just not sending data
OS-level timeouts	No socket error because the connection wasn't dead

Time (UTC)	Event	Evidence
20:22:43	Session active, tools executing	Trace shows 12896↓ 4059↑ tokens
20:22:46	Thinking block completes	`content[0].type: "thinking"`
20:22:47	Tool use begins	`name: "Bash"` in content
20:22:56	Last recorded SSE event	Final JSONL entry
20:24:25+	Stream stalls	No more events, `stop_reason` still `null`
20:37:00+	User notices hang	~15 minutes of lost productivity

bug/proposal: infinitely hanging clients breaking bigger/complex sessions. Proposal: add Streaming Idle Timeout to Prevent Indefinite Hangs #867

Description

This issue serves as reference and possible solution to what is outlined here #868 as a possible PR for discussion

Summary

Problem Description

Current Behavior: A Real-World Failure

The Incident

What the User Experienced

Why Existing Timeouts Didn't Help

Timeline of the Failure

Network State at Time of Hang

The Core Problem

Expected Behavior

Minimal Reproduction

Impact

Proposed Solution

API Design

Usage

Error Type

Implementation Approach

Type Contract Compatibility

Testing Strategy

Unit Tests

Mock Helpers

Alternatives Considered

1. Rely on OS-level TCP keepalive

2. Implement in application code (wrapper)

3. Use overall timeout option

Related Issues

Implementation Scope

Questions for Maintainers

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

3. Use overall `timeout` option