-
Notifications
You must be signed in to change notification settings - Fork 655
feat: ETCD high availability client failover - lease keep-alive resilience #3868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
69a33de to
88bae48
Compare
88bae48 to
7ea9720
Compare
* [PoC] Recover from ETCD server disconnect Signed-off-by: Jacky <[email protected]>
Signed-off-by: Jacky <[email protected]>
7ea9720 to
cf01d5c
Compare
|
@coderabbitai review |
it disappeared from the checks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements ETCD high availability support for lease keep-alive operations, enabling automatic client failover when an ETCD server becomes unavailable. The implementation introduces a Connector abstraction that manages ETCD client connections and handles reconnection logic with exponential backoff.
Key changes:
- Implemented
Connectorfor managing ETCD client connections with automatic reconnection - Enhanced
keep_alivefunction to detect stream failures and trigger reconnection - Updated all ETCD client access points to use the new async-aware connection management
Reviewed Changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/fault_tolerance/etcd_ha/utils.py | Test utilities for ETCD HA scenarios including cluster management and process monitoring |
| tests/fault_tolerance/etcd_ha/test_vllm.py | Integration tests verifying ETCD failover behavior and graceful shutdown |
| lib/runtime/src/transports/etcd/connector.rs | New connector module managing ETCD connections with reconnection logic |
| lib/runtime/src/transports/etcd/lease.rs | Updated lease keep-alive to use connector and handle stream reconnection |
| lib/runtime/src/transports/etcd.rs | Refactored client to use connector and provide async client access |
| lib/runtime/src/transports/etcd/lock.rs | Updated to use async client accessor |
| lib/runtime/src/storage/key_value_store/etcd.rs | Updated to use async client accessor |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <[email protected]> Signed-off-by: Jacky <[email protected]>
Co-authored-by: Copilot <[email protected]> Signed-off-by: Jacky <[email protected]>
Overview:
When a replica of ETCD servers are available, the client should automatically failover to the other ETCD server when the connection to the current server is lost, providing resilience to any single ETCD server failure.
Details:
Tests:
Where should the reviewer start?
Start with the test cases, and then move on to the new connector.rs, which substitutes the
etcd_client::Clientin etcd.rs, and then the other Rust pieces.Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
WatchClient#1592