Skip to content

fix(chat): WebSocket reliability, scroll race fix, and UX improvements#86

Open
dvlexp wants to merge 4 commits into
evolution-foundation:mainfrom
dvlexp:fix/chat-ws-reliability
Open

fix(chat): WebSocket reliability, scroll race fix, and UX improvements#86
dvlexp wants to merge 4 commits into
evolution-foundation:mainfrom
dvlexp:fix/chat-ws-reliability

Conversation

@dvlexp
Copy link
Copy Markdown

@dvlexp dvlexp commented May 23, 2026

Problem

Three independent chat stability issues affecting production use:

  1. WebSocket sessions dying in background tabs — no keepalive mechanism caused half-open connections that appeared alive but dropped messages silently
  2. Forced scroll interrupting reading — chat was jumping to the bottom on every new message, even when the user had scrolled up to read history
  3. Crash on undefined blocksAgentChat rendered blocks unconditionally, crashing when blocks was absent from a message

Solution

  • websocket-client>=1.9 added to deps (pyproject.toml + uv.lock) — required by the keepalive implementation
  • Keepalive ping/pong in AgentChat — sends a ping every 20s, resets on any incoming message; detects half-open connections before they cause user-visible failures
  • Auto-reconnect with exponential backoff — on unexpected close, retries with 1s→2s→4s delay (max 4s), up to 5 attempts; surfaces status to the user
  • Scroll anchor respect — tracks isAtBottom state; auto-scroll only triggers when the user is already at the bottom; adds "scroll to bottom" button when pinned elsewhere
  • Defensive blocks render — guards all block rendering with null checks; no more crashes on messages with missing/undefined structure

Test Plan

  • Open chat in a background tab for 10+ minutes, return — connection is still active
  • Start reading old messages by scrolling up — new arriving messages do not force-scroll to bottom
  • Verify "scroll to bottom" button appears when scrolled up
  • Send a message that returns an empty blocks field — no crash, graceful render
  • Kill the backend mid-session — UI shows reconnecting state, recovers automatically

Commits

Hash Description
3dd0acf chore(deps): add websocket-client + regenerate lock
6d51260 fix(chat): keep WS alive on idle + auto-reconnect
9c6a9a4 fix(chat): respect manual scroll (don't jump to bottom)
81dcebe fix(chat): scroll race + half-open WS + defensive blocks render

Summary by Sourcery

Improve AgentChat WebSocket reliability and chat scrolling UX while hardening message rendering against missing blocks.

New Features:

  • Add a jump-to-bottom control that appears when the user scrolls up in the chat history.

Bug Fixes:

  • Keep AgentChat WebSocket sessions alive across idle periods using ping/pong heartbeats and prevent half-open connections from silently dropping messages.
  • Introduce automatic WebSocket reconnection with backoff when the connection closes unexpectedly.
  • Respect the user’s scroll position so new messages only auto-scroll when the view is already near the bottom, avoiding forced scroll jumps during reading.
  • Guard assistant message rendering and text extraction against missing or undefined blocks to avoid runtime crashes.
  • Prevent the backend terminal proxy from closing idle WebSocket connections on simple receive() timeouts, preserving client sessions.

Enhancements:

  • Allow users to continue typing during transient WebSocket reconnects by only disabling input on hard errors.

Build:

  • Add websocket-client as a backend dependency and update the lockfile.

dvlexp and others added 4 commits May 22, 2026 21:36
websocket-client>=1.9 required by local skills/tools. Lock regenerated
after merging origin/main; pulls in flask-limiter (rate-limit on public
share endpoint, evolution-foundation#52), limits, wrapt, ordered-set, deprecated.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
terminal_proxy.py fechava a ponte WS quando client_ws.receive(timeout=30)
retornava None — mas None é timeout, não desconexão (disconexão real
levanta ConnectionClosed). Em abas em background o ping de 25s do
AgentChat é throttled pelo navegador para <1×/min, batendo o timeout
e derrubando o WS com o peer ainda vivo. Como o frontend não tinha
auto-reconnect, wsRef.current ficava apontando para um socket fechado
e sendMessage() retornava em silêncio — sintoma do "clico enviar e
nada acontece" relatado pelos usuários.

Backend: continue em vez de break no receive timeout. Disconexões reais
seguem caindo no except → finally normalmente.

Frontend: useEffect refatorado para função connect() reusável, com
backoff 1s→30s acionado no onclose. Ping interval virou per-WS
(localPing) para que o onclose de uma WS antiga não limpe o ping da
nova durante reconnects encadeados. wsRef agora é zerado no onclose
para evitar sends em socket morto.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Antes: scrollToBottom() forçava scrollTop = scrollHeight em todo render/
delta, então o usuário não conseguia rolar pra cima pra ler histórico —
a próxima mensagem (ou cada chunk de stream) jogava ele de volta no fundo.

Agora: isAtBottomRef rastreia se o usuário está perto do fundo (<50px).
Se ele rolou pra cima, scrollToBottom() vira no-op e aparece um botão
flutuante "Ir para o final" que reativa o follow ao clicar. Mandar nova
mensagem ou re-enviar (edit/rewind) força a flag de volta a true porque
é sinal claro de "quero ver o resultado". History restore (session_joined,
chat_history) também força true — abrindo a sessão você sempre cai no fundo.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nder

3 unrelated reliability fixes accumulated in AgentChat.tsx:

1. Scroll race: under heavy streaming (60+ deltas/sec) the scrollToBottom
   rAF could fire before the user's onScroll propagated, leaving
   isAtBottomRef stale and teleporting them back to bottom. Now we
   recompute distance-from-bottom inside the rAF callback and update the
   flag proactively if the user moved away.

2. Heartbeat timeout: track lastPongAt on every server pong. The ws lib
   can leave half-open sockets (TCP dead, no close frame), so onclose
   never fires and chat_event stops silently. The interval now also
   checks for stale pongs and forces a reconnect.

3. Defensive blocks render: msg.blocks could be undefined for assistant
   messages mid-stream or from legacy formats, crashing
   .map/.some/.filter and triggering the ErrorBoundary ("Unable to load
   dashboard section"). Fixed in 3 sites: getMessageText (line 771),
   blocks rendering (1278), streaming hasVisibleContent check (1292).

Bug 3 was breaking the entire /agents/<name> page for every agent with
chat history — reported today on clawdia-assistant but not specific.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented May 23, 2026

Reviewer's Guide

Implements WebSocket keepalive and auto-reconnect for AgentChat, refines scroll behavior to respect manual scrolling with a jump-to-bottom control, and hardens block rendering and backend WS proxy behavior to avoid crashes and silent disconnects, plus adds the websocket-client dependency.

Sequence diagram for AgentChat WebSocket keepalive and auto-reconnect

sequenceDiagram
  actor User
  participant AgentChat
  participant BrowserWS as WebSocket
  participant TerminalProxy
  participant Upstream as UpstreamServer

  User->>AgentChat: open AgentChat(sessionId)
  activate AgentChat
  AgentChat->>AgentChat: connect()
  AgentChat->>TerminalProxy: requests.get(TS_HTTP + /health)
  TerminalProxy-->>AgentChat: 200 OK
  AgentChat->>BrowserWS: new WebSocket(TS_WS + /ws)
  BrowserWS-->>AgentChat: onopen
  AgentChat->>BrowserWS: ws.send({ type: join_session, sessionId })
  AgentChat->>AgentChat: reconnectDelayRef = 1000
  AgentChat->>AgentChat: setInterval(localPing, 25000)

  loop keepalive
    AgentChat->>BrowserWS: ws.send({ type: ping })
    BrowserWS->>TerminalProxy: ping
    TerminalProxy->>Upstream: forward ping
    Upstream-->>TerminalProxy: pong
    TerminalProxy-->>BrowserWS: pong
    BrowserWS-->>AgentChat: message type=pong
    AgentChat->>AgentChat: lastPongAt = Date.now()
  end

  note over AgentChat,BrowserWS: Heartbeat timeout
  AgentChat->>AgentChat: [Date.now() - lastPongAt > 60000]
  AgentChat->>BrowserWS: ws.close()

  BrowserWS-->>AgentChat: onclose
  AgentChat->>AgentChat: clearInterval(localPing)
  AgentChat->>AgentChat: scheduleReconnect()
  AgentChat->>AgentChat: setTimeout(connect, reconnectDelayRef)

  AgentChat->>AgentChat: reconnectDelayRef = min(reconnectDelayRef * 2, 30000)
  AgentChat->>AgentChat: connect()  // new WebSocket
  deactivate AgentChat
Loading

File-Level Changes

Change Details Files
Add resilient WebSocket connection management with keepalive, heartbeat timeouts, and exponential backoff reconnect in AgentChat.
  • Refactor WebSocket connection logic into a reusable connect() function with scheduleReconnect and cancellation handling.
  • Introduce reconnectTimerRef and reconnectDelayRef to implement capped exponential backoff and prevent overlapping reconnect timers.
  • Reset reconnect delay on successful onopen and clear timers/refs on cleanup to avoid stale state across session changes.
  • Add pong-based heartbeat tracking with lastPongAt and a per-connection ping interval that force-closes and reconnects on missing heartbeats.
  • Modify onerror and onclose handlers to rely on reconnect logic instead of surfacing transient errors or leaving closed sockets in wsRef.
  • Ensure backend terminal_proxy receive loop treats None as a timeout and continues, avoiding dropping idle connections.
dashboard/frontend/src/components/AgentChat.tsx
dashboard/backend/routes/terminal_proxy.py
Improve chat scrolling UX by tracking bottom position, preventing forced scroll when user is reading history, and adding a jump-to-bottom control.
  • Introduce isAtBottomRef and showJumpToBottom state to track whether auto-scroll should follow new messages.
  • Rewrite scrollToBottom to recompute distance from bottom inside requestAnimationFrame to avoid race conditions with user scrolls.
  • Add forceScrollToBottom for explicit user-initiated jumps that reset the bottom-following state.
  • Implement handleScroll and attach it to the messages container to update bottom state with a 50px threshold and toggle the jump-to-bottom button.
  • Ensure reconnect/history restore, sendMessage, and editMessage flows reset isAtBottomRef and hide the jump-to-bottom button when appropriate.
  • Render a floating jump-to-bottom button positioned over the messages area using ChevronDown and accentColor styling.
dashboard/frontend/src/components/AgentChat.tsx
Make AgentChat robust to missing or empty blocks and avoid disabling input on transient connection issues.
  • Guard block usage by defaulting msg.blocks to an empty array when extracting text and when rendering assistant blocks and typing indicators.
  • Adjust inputDisabled to only depend on effectiveError, allowing typing during reconnects while still gating sends via canSend.
  • Initialize scroll position flags to bottom when histories are loaded to keep behavior consistent on session join.
dashboard/frontend/src/components/AgentChat.tsx
Add websocket-client dependency required for backend WebSocket client behavior and update lockfile.
  • Add websocket-client>=1.9 to application dependencies in pyproject.toml.
  • Regenerate uv.lock to capture the new dependency resolution.
pyproject.toml
uv.lock

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • The floating "Ir para o final" button is rendered as a sibling of the scroll container but uses absolute positioning; consider nesting it inside the relatively positioned scroll container (or adding a dedicated positioned wrapper) so its placement is reliably anchored to the chat area rather than the page.
  • New user-facing strings and aria-labels for the jump-to-bottom button are in Portuguese while the rest of the component appears English; consider aligning language for consistency and accessibility.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The floating "Ir para o final" button is rendered as a sibling of the scroll container but uses absolute positioning; consider nesting it inside the relatively positioned scroll container (or adding a dedicated positioned wrapper) so its placement is reliably anchored to the chat area rather than the page.
- New user-facing strings and aria-labels for the jump-to-bottom button are in Portuguese while the rest of the component appears English; consider aligning language for consistency and accessibility.

## Individual Comments

### Comment 1
<location path="dashboard/frontend/src/components/AgentChat.tsx" line_range="1010-1018" />
<code_context>

   const isConnecting = externalLoading || status === 'connecting'
   const effectiveError = externalError || (status === 'error' ? errorMsg : null)
-  const inputDisabled = isConnecting || !!effectiveError
+  // Don't disable the textarea during transient reconnects — disabling blurs
+  // the cursor (native behavior of <textarea disabled>), which forces the user
+  // to click back in after every WS hiccup. canSend already gates the Send
+  // button on readyState === OPEN, so typing while the socket is down is safe:
+  // the text stays in React state and gets sent the moment the WS reopens.
+  // Only hard errors (server unreachable) still disable input.
+  const inputDisabled = !!effectiveError
   const canSend = (input.trim().length > 0 || attachedFiles.length > 0) && !inputDisabled && status !== 'running'

</code_context>
<issue_to_address>
**nitpick:** `isConnecting` is now unused and the new `inputDisabled` behavior might surprise future readers.

Since `isConnecting` is no longer used (and not referenced elsewhere), please remove it to avoid confusion. Also consider a brief inline comment or a more specific name for `inputDisabled` to clarify that it now reflects only error state, not connecting state, so future changes don’t accidentally reintroduce the old behavior.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines 1010 to +1018
const isConnecting = externalLoading || status === 'connecting'
const effectiveError = externalError || (status === 'error' ? errorMsg : null)
const inputDisabled = isConnecting || !!effectiveError
// Don't disable the textarea during transient reconnects — disabling blurs
// the cursor (native behavior of <textarea disabled>), which forces the user
// to click back in after every WS hiccup. canSend already gates the Send
// button on readyState === OPEN, so typing while the socket is down is safe:
// the text stays in React state and gets sent the moment the WS reopens.
// Only hard errors (server unreachable) still disable input.
const inputDisabled = !!effectiveError
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: isConnecting is now unused and the new inputDisabled behavior might surprise future readers.

Since isConnecting is no longer used (and not referenced elsewhere), please remove it to avoid confusion. Also consider a brief inline comment or a more specific name for inputDisabled to clarify that it now reflects only error state, not connecting state, so future changes don’t accidentally reintroduce the old behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant