Skip to content

fix(agent): retry stream on raw connection reset errors#2931

Closed
johnjansen wants to merge 1 commit into
charmbracelet:mainfrom
johnjansen:fix/connection-reset-retry
Closed

fix(agent): retry stream on raw connection reset errors#2931
johnjansen wants to merge 1 commit into
charmbracelet:mainfrom
johnjansen:fix/connection-reset-retry

Conversation

@johnjansen
Copy link
Copy Markdown
Contributor

Problem

When the upstream model provider tears down the TCP connection mid-handshake
or before any stream bytes arrive, the error surfaces unwrapped from the
HTTP client (commonly as read tcp ...: connection reset by peer). Fantasy's
own retry layer only fires for errors it can cast to *fantasy.ProviderError
with IsRetryable() == true. A raw syscall.ECONNRESET never gets that
wrapping, so the session bombs out and the user has to re-prompt by hand.

Before

  1. Provider RSTs the TCP connection before the first SSE event.
  2. agent.Stream returns &net.OpError{Err: syscall.ECONNRESET}.
  3. Fantasy retry does nothing (not a ProviderError).
  4. Run enters the error branch, writes a finish message, returns.
  5. User sees "connection reset by peer" and must resend the prompt.

After

  1. Provider RSTs the TCP connection before the first SSE event.
  2. agent.Stream returns the same raw error.
  3. Agent layer checks isTransientNetErr(err) and currentAssistant == nil.
  4. Both true → wait 1s, retry. Up to 2 retries (3 total attempts), 1s + 2s
    backoff worst case.
  5. On success: stream proceeds normally. On exhaustion: surfaces the error
    through the existing error path unchanged.

Why it's safe

  • No double execution. Retry is gated on currentAssistant == nil.
    currentAssistant is set inside PrepareStep, which only runs once the
    request reaches the model. If a single byte of response arrived, we don't
    retry — we surface the error.
  • No infinite loop. Hard cap of transientMaxAttempts = 3. Linear
    backoff. genCtx.Done() short-circuits the wait.
  • Conservative detector. Only matches syscall.ECONNRESET, syscall.EPIPE,
    io.ErrUnexpectedEOF, or the canonical "connection reset by peer" text.
    Context cancellation and deadlines are explicitly excluded so user-cancel
    semantics are preserved.
  • No state pollution. The user message is persisted before the loop;
    retries don't duplicate it. The title-generation goroutine is independent.
  • Complementary to fantasy's retry. Wrapped ProviderErrors with HTTP
    status retry continue to flow through fantasy's existing logic; this layer
    only fills the unwrapped-syscall-error gap.

Test plan

  • go test ./internal/agent/... passes.
  • go build ./... clean.
  • gofmt -l and goimports -l clean on changed files.
  • New unit tests cover: nil, ctx errors, unrelated errors, raw sentinels
    (ECONNRESET, EPIPE, io.ErrUnexpectedEOF), net.OpError-wrapped
    sentinel, fmt.Errorf-wrapped sentinel, and text-only fallback.

Files

  • internal/agent/transient.goisTransientNetErr + cap constant (new).

  • internal/agent/transient_test.go — detector tests (new).

  • internal/agent/agent.go — wrap agent.Stream in bounded retry loop.

  • I have read CONTRIBUTING.md.

@johnjansen
Copy link
Copy Markdown
Contributor Author

CI note: the Windows job failure is in internal/ui/diffview/TestDiffViewWidth — a chroma syntax-highlighting golden snapshot where a } rendered in a different color. That code is not touched by this PR (which only modifies internal/agent/). macOS passed; Ubuntu was cancelled by fail-fast propagation, not by an actual error (build + earlier test steps were green). Looks like a pre-existing Windows-specific rendering flake — happy to investigate separately if useful, but it's orthogonal to this change.

@johnjansen
Copy link
Copy Markdown
Contributor Author

Superseded by #2945 (fantasy v0.25.0) which adds net.Error retry at the provider layer in v0.70.0. Closing.

@johnjansen johnjansen closed this May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant