Skip to content

fix(feishu): add transient network retry for message send operations#428

Open
0xsegfaulted wants to merge 1 commit intochenhg5:mainfrom
0xsegfaulted:fix/feishu-transient-send-retry
Open

fix(feishu): add transient network retry for message send operations#428
0xsegfaulted wants to merge 1 commit intochenhg5:mainfrom
0xsegfaulted:fix/feishu-transient-send-retry

Conversation

@0xsegfaulted
Copy link
Copy Markdown
Contributor

Summary

  • Add withTransientRetry wrapper with exponential backoff + jitter (500ms→1s→2s, max 3 retries, +25% jitter) for all Feishu HTTP API calls
  • Detect transient network errors via typed syscall checks (ECONNRESET, EPIPE), net.Error timeout, io.EOF, and string-matching fallback
  • Wrap all 8 Feishu API call sites: reply, create, patch, delete, upload image/file/audio, and preview start

Problem

When Feishu WebSocket connections experience rapid resets (every 20-40s for ~3 minutes), HTTP API calls for sending messages also fail with transient network errors (EOF, connection reset by peer, i/o timeout). Previously these errors were not retried — only token refresh retry (code 99991663) existed — causing messages to be silently lost with delays of 5-25 minutes.

Design

  • Layering: transient retry (outer) wraps token refresh retry (inner), so a transient retry can re-trigger a fresh token acquisition if needed
  • Jitter: +25% random jitter on each backoff delay to prevent thundering-herd retries from concurrent sessions
  • Non-transient errors (API error codes, rate limits, validation) are not retried — returned immediately

Test plan

  • TestIsTransientError — 16 error type subtests including syscall, net.Error, EOF, string matching, and negative cases
  • TestWithTransientRetry_* — 5 unit tests: success, retry-then-succeed, non-transient passthrough, max retry exhaustion, context cancellation
  • TestReplyRetriesOnTransientNetworkError — integration test with httptest + connection hijacking
  • TestCreateMessageRetriesOnTransientNetworkError — integration test
  • TestPatchMessageRetriesOnTransientError — integration test
  • TestReplyDoesNotRetryOnNonTransientAPIError — integration test
  • TestReplyTransientRetryThenTokenRefresh — validates both retry layers working together
  • TestWithTransientRetry_BackoffTiming — verifies backoff timing is reasonable
  • All existing feishu tests still pass
  • go build ./... clean

🤖 Generated with Claude Code

When Feishu WebSocket connections experience rapid resets, HTTP API calls
for sending messages also fail with transient network errors (EOF,
connection reset, timeout). Previously these errors were not retried,
causing messages to be silently lost.

Add withTransientRetry wrapper with exponential backoff + jitter
(500ms→1s→2s, max 3 retries, +25% jitter) around all Feishu API calls:
reply, create, patch, delete, upload image/file/audio, and preview.
Transient detection uses typed syscall checks (ECONNRESET, EPIPE),
net.Error timeout, io.EOF, and string matching as fallback.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Owner

@chenhg5 chenhg5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Well-structured transient network retry implementation for Feishu.

Review summary:

  • withTransientRetry wrapper with exponential backoff + jitter (500ms→1s→2s, max 3 retries, +25% jitter)
  • ✅ Comprehensive transient error detection: syscall checks (ECONNRESET, EPIPE), net.Error timeout, io.EOF, string-matching fallback
  • ✅ Layered design: transient retry (outer) wraps token refresh retry (inner)
  • ✅ Non-transient errors (API errors, rate limits) are not retried
  • ✅ All 8 Feishu API call sites wrapped
  • ✅ Comprehensive test coverage: 16 error type subtests, integration tests with httptest + connection hijacking
  • ✅ CI passes (lint, unit-test, smoke-test, regression-test, performance-test)

Great contribution! Thanks for the detailed PR description and comprehensive tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants