Skip to content

fix: improve platform gateway reliability — multi-process protection, hook chat_id, and Feishu event queueing#4789

Open
milkoor wants to merge 2 commits intoNousResearch:mainfrom
milkoor:fix/feishu-gateway-reliability
Open

fix: improve platform gateway reliability — multi-process protection, hook chat_id, and Feishu event queueing#4789
milkoor wants to merge 2 commits intoNousResearch:mainfrom
milkoor:fix/feishu-gateway-reliability

Conversation

@milkoor
Copy link
Copy Markdown

@milkoor milkoor commented Apr 3, 2026

Problem

Platform gateways (Feishu, Telegram, etc.) suffer from three reliability issues:

  1. Silent message drops: When the Feishu WebSocket callback fires before the main event loop is ready, inbound messages are dropped with a warning log but never delivered.
  2. Concurrent gateway processes: Multiple gateway processes (e.g. hermes run + hermes gateway) share the lark_oapi global state, meaning only the last-started process receives events. This causes silent disconnections where Feishu shows connected but messages go unanswered.
  3. Missing chat_id in hook context: Gateway hooks (e.g. card-streaming hooks) receive session_id but not chat_id, preventing platform adapters from creating platform-specific cards/messages without using internal session IDs.

Changes

gateway/run.py

  • Add chat_id to hook_ctx for agent:start events, enabling hook handlers to target the correct platform destination

hermes_cli/gateway.py

  • Prevent concurrent gateway instances by checking for existing gateway processes before starting
  • When an existing instance is detected, the new instance aborts with a helpful message (hermes gateway run --replace to override)

gateway/platforms/feishu.py

  • Queue inbound Feishu WebSocket events when the adapter loop is not yet ready, then replay them once the loop is available
  • Previously these events were silently dropped, causing users to lose their first message after gateway restart

Testing

  • ✅ All three patches apply cleanly against latest main (cc54818)
  • ✅ Gateway syntax check passes for all modified files
  • ✅ Tested on live Feishu gateway — messages flow correctly after restart

… hook context chat_id, and Feishu event queueing

- Add chat_id to agent:start hook context for platform adapters to create
  platform-specific cards without using internal session IDs
- Prevent concurrent gateway instances from sharing lark_oapi global
  state (causes event delivery failures and silent disconnections)
- Queue Feishu WebSocket events when adapter loop is not yet ready
  instead of silently dropping inbound messages
- Add --force-restart and --force-kill options to monitor_gateway.py
  with stale state detection and append-only logging

Co-authored-by: Hermes Agent
@britrik
Copy link
Copy Markdown

britrik commented Apr 3, 2026

Code Review: PR #4789

Summary

The PR introduces three reliability improvements:

  1. Feishu event queueing - queues events when the adapter loop isn't ready
  2. chat_id hook addition - adds chat_id to the agent:start hook context
  3. Multi-process protection - prevents duplicate gateway instances

Issues & Suggestions

1. Feishu Event Queueing (gateway/platforms/feishu.py)

Issue: Thread spawning per event is wasteful

Every inbound message event before the loop is ready spawns a new thread. If 10 events arrive quickly, that's 10 threads. Consider using a single background worker or batching.

Issue: No bounds checking on _pending_events
If the loop never becomes ready (e.g., prolonged startup issue), the list grows unbounded, potentially causing memory exhaustion. Consider adding a max queue size.

Issue: Silent failure in _drain_pending_events
The call in has no done_callback to log failures. If an event fails to process, it silently disappears.

Suggestion: Add a try/except around the coroutine call:


2. Multi-Process Protection (hermes_cli/gateway.py)

Issue: Fragile environment variable check

This could be bypassed if another process sets this env var. Consider using a lock file or a more robust mechanism like PID files with validation.

Positive: The find_gateway_pids() function is solid and handles both Linux/macOS and Windows.


3. chat_id Hook Addition (gateway/run.py)

Looks good: Straightforward addition. The is used extensively elsewhere in the codebase (lines 1717, 1727, 2477, etc.), so this should work correctly.


Overall

The changes address real reliability issues. The Feishu queueing could be improved with a single worker thread instead of per-event threads, and failure handling could be better. The multi-process protection and chat_id hook are solid.

…event SDK error loops

The Lark SDK emits persistent errors for unhandled
im.chat.access_event.bot_p2p_chat_entered_v1 events (triggered
when users open the bot DM chat). These errors eventually cause
the WebSocket connection to drop.

Register a no-op handler to suppress the errors.

Fixes: NousResearch#4789
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants