Skip to content

Conversation

@kthui
Copy link
Contributor

@kthui kthui commented Oct 24, 2025

Overview:

When a replica of ETCD servers are available, the client should automatically failover to the other ETCD server when the connection to the current server is lost, providing resilience to any single ETCD server failure.

Details:

  • Lease keep-alive failover ✅
  • Lease watcher failover 🚧 - will be included in the next PR

Tests:

  • Basic aggregated serving ETCD failover ✅ - need to remove allowed to fail on next PR
  • Graceful shutdown when ETCD failover failed ✅

Where should the reviewer start?

Start with the test cases, and then move on to the new connector.rs, which substitutes the etcd_client::Client in etcd.rs, and then the other Rust pieces.

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

@kthui kthui self-assigned this Oct 24, 2025
@github-actions github-actions bot added the feat label Oct 24, 2025
@kthui kthui force-pushed the jacky-etcd-disconnect branch from 69a33de to 88bae48 Compare October 24, 2025 20:08
@kthui kthui force-pushed the jacky-etcd-disconnect branch from 88bae48 to 7ea9720 Compare October 24, 2025 21:18
@kthui kthui changed the title feat: ETCD high availability client failover feat: ETCD high availability client failover - lease keep-alive Oct 24, 2025
@kthui kthui changed the title feat: ETCD high availability client failover - lease keep-alive feat: ETCD high availability client failover - lease keep-alive resilience Oct 24, 2025
@kthui kthui marked this pull request as ready for review October 24, 2025 21:37
@kthui kthui requested review from a team as code owners October 24, 2025 21:37
kthui added 2 commits October 24, 2025 14:47
* [PoC] Recover from ETCD server disconnect

Signed-off-by: Jacky <[email protected]>
@rmccorm4
Copy link
Contributor

@coderabbitai review

@kthui
Copy link
Contributor Author

kthui commented Oct 24, 2025

@coderabbitai review

it disappeared from the checks

@kthui kthui requested a review from Copilot October 24, 2025 22:05
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements ETCD high availability support for lease keep-alive operations, enabling automatic client failover when an ETCD server becomes unavailable. The implementation introduces a Connector abstraction that manages ETCD client connections and handles reconnection logic with exponential backoff.

Key changes:

  • Implemented Connector for managing ETCD client connections with automatic reconnection
  • Enhanced keep_alive function to detect stream failures and trigger reconnection
  • Updated all ETCD client access points to use the new async-aware connection management

Reviewed Changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/fault_tolerance/etcd_ha/utils.py Test utilities for ETCD HA scenarios including cluster management and process monitoring
tests/fault_tolerance/etcd_ha/test_vllm.py Integration tests verifying ETCD failover behavior and graceful shutdown
lib/runtime/src/transports/etcd/connector.rs New connector module managing ETCD connections with reconnection logic
lib/runtime/src/transports/etcd/lease.rs Updated lease keep-alive to use connector and handle stream reconnection
lib/runtime/src/transports/etcd.rs Refactored client to use connector and provide async client access
lib/runtime/src/transports/etcd/lock.rs Updated to use async client accessor
lib/runtime/src/storage/key_value_store/etcd.rs Updated to use async client accessor

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Jacky <[email protected]>
@kthui kthui requested a review from keivenchang October 24, 2025 23:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants