fix: improve platform gateway reliability — multi-process protection, hook chat_id, and Feishu event queueing#4789
Conversation
… hook context chat_id, and Feishu event queueing - Add chat_id to agent:start hook context for platform adapters to create platform-specific cards without using internal session IDs - Prevent concurrent gateway instances from sharing lark_oapi global state (causes event delivery failures and silent disconnections) - Queue Feishu WebSocket events when adapter loop is not yet ready instead of silently dropping inbound messages - Add --force-restart and --force-kill options to monitor_gateway.py with stale state detection and append-only logging Co-authored-by: Hermes Agent
Code Review: PR #4789SummaryThe PR introduces three reliability improvements:
Issues & Suggestions1. Feishu Event Queueing (gateway/platforms/feishu.py)Issue: Thread spawning per event is wasteful Every inbound message event before the loop is ready spawns a new thread. If 10 events arrive quickly, that's 10 threads. Consider using a single background worker or batching. Issue: No bounds checking on _pending_events Issue: Silent failure in _drain_pending_events Suggestion: Add a try/except around the coroutine call: 2. Multi-Process Protection (hermes_cli/gateway.py)Issue: Fragile environment variable check This could be bypassed if another process sets this env var. Consider using a lock file or a more robust mechanism like PID files with validation. Positive: The find_gateway_pids() function is solid and handles both Linux/macOS and Windows. 3. chat_id Hook Addition (gateway/run.py)Looks good: Straightforward addition. The is used extensively elsewhere in the codebase (lines 1717, 1727, 2477, etc.), so this should work correctly. OverallThe changes address real reliability issues. The Feishu queueing could be improved with a single worker thread instead of per-event threads, and failure handling could be better. The multi-process protection and chat_id hook are solid. |
…event SDK error loops The Lark SDK emits persistent errors for unhandled im.chat.access_event.bot_p2p_chat_entered_v1 events (triggered when users open the bot DM chat). These errors eventually cause the WebSocket connection to drop. Register a no-op handler to suppress the errors. Fixes: NousResearch#4789
Problem
Platform gateways (Feishu, Telegram, etc.) suffer from three reliability issues:
hermes run+hermes gateway) share thelark_oapiglobal state, meaning only the last-started process receives events. This causes silent disconnections where Feishu shows connected but messages go unanswered.chat_idin hook context: Gateway hooks (e.g. card-streaming hooks) receivesession_idbut notchat_id, preventing platform adapters from creating platform-specific cards/messages without using internal session IDs.Changes
gateway/run.pychat_idtohook_ctxforagent:startevents, enabling hook handlers to target the correct platform destinationhermes_cli/gateway.pyhermes gateway run --replaceto override)gateway/platforms/feishu.pyTesting
main(cc54818)