-
Notifications
You must be signed in to change notification settings - Fork 438
feat[Store]: Add standalone deployment implementation for Client #1084
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @YiXR, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant architectural enhancement by decoupling the client-side logic into independent processes. This change allows for more flexible deployment scenarios and improves the resilience of the storage system. By leveraging shared memory for inter-process communication, the new design maintains high performance, specifically enabling zero-copy data transfers between the client components. The refactoring also sets the stage for future scalability and robustness improvements. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant architectural change to support independent deployment of the client, using a DummyClient and RealClient model with shared memory for zero-copy communication. While the overall structure is well-defined, the implementation has several critical issues that need to be addressed. My review identified a logic bug in error handling within the Python bindings, multiple key functions in DummyClient that are left as unimplemented TODOs, and incomplete high-availability logic. Most critically, there's a design flaw in how RealClient monitors and cleans up resources for DummyClient instances, which would fail in a multi-client scenario. I've detailed these points in the specific comments.
| void RealClient::dummy_client_monitor_func() { | ||
| std::unordered_map<UUID, std::chrono::steady_clock::time_point, | ||
| boost::hash<UUID>> | ||
| client_ttl; | ||
| while (dummy_client_monitor_running_) { | ||
| auto now = std::chrono::steady_clock::now(); | ||
|
|
||
| // Update the client ttl | ||
| PodUUID pod_client_id; | ||
| while (dummy_client_ping_queue_.pop(pod_client_id)) { | ||
| UUID client_id = {pod_client_id.first, pod_client_id.second}; | ||
| client_ttl[client_id] = | ||
| now + std::chrono::seconds(dummy_client_live_ttl_sec_); | ||
| } | ||
|
|
||
| // Find out expired clients | ||
| std::vector<UUID> expired_clients; | ||
| for (auto it = client_ttl.begin(); it != client_ttl.end();) { | ||
| if (it->second < now) { | ||
| LOG(INFO) << "client_id=" << it->first | ||
| << ", action=client_expired"; | ||
| expired_clients.push_back(it->first); | ||
| it = client_ttl.erase(it); | ||
| } else { | ||
| ++it; | ||
| } | ||
| } | ||
|
|
||
| // Update the client status to NEED_REMOUNT | ||
| if (!expired_clients.empty()) { | ||
| { | ||
| // Lock client_mutex and segment_mutex | ||
| std::unique_lock<std::shared_mutex> lock(dummy_client_mutex_); | ||
| // Unmap shm segments associated with this client | ||
| // TODO: need to optimize this with client_id | ||
| unmap_shm_internal(); | ||
| } // Release the mutex before long-running ClearInvalidHandles and | ||
| // avoid deadlocks | ||
| } | ||
|
|
||
| std::this_thread::sleep_for( | ||
| std::chrono::milliseconds(kDummyClientMonitorSleepMs)); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dummy_client_monitor_func has a design flaw that prevents it from correctly handling multiple DummyClient instances. When a DummyClient is detected as expired, unmap_shm_internal() is called without any client-specific identifier. This function cleans up the shared memory segment information stored in RealClient members like shm_buffer_, shm_name_, etc. Since these are single-instance members, this action will affect all connected DummyClients, not just the expired one. This can lead to memory corruption, crashes, or data loss if multiple dummy clients are used.
The implementation should be updated to manage shared memory resources on a per-client basis. The // TODO comment on line 1630 also indicates this is an incomplete implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be solved later. I'll introduce another patch for multi dummy support.
| auto put_result = store_->put_parts(key, values); | ||
| if (!put_result) { | ||
| return -static_cast<int>(put_result.error()); | ||
| return -static_cast<int>(ErrorCode::INTERNAL_ERROR); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a logic error in handling the return value from store_->put_parts. The put_parts function returns 0 on success and a negative error code on failure. The current code if (!put_result) evaluates to true on success (when put_result is 0), causing it to incorrectly return an error. This should be if (put_result != 0).
Additionally, on failure, it returns a generic ErrorCode::INTERNAL_ERROR instead of propagating the specific error code contained in put_result.
| auto put_result = store_->put_parts(key, values); | |
| if (!put_result) { | |
| return -static_cast<int>(put_result.error()); | |
| return -static_cast<int>(ErrorCode::INTERNAL_ERROR); | |
| } | |
| auto put_result = store_->put_parts(key, values); | |
| if (put_result != 0) { | |
| return put_result; | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
|
||
| std::vector<std::shared_ptr<BufferHandle>> DummyClient::batch_get_buffer( | ||
| const std::vector<std::string>& keys) { | ||
| // TODO: implement this function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several functions in DummyClient are not implemented and are marked with // TODO: implement this function. This includes batch_get_buffer, get_into, put_from, put_from_with_metadata, batch_put_from_multi_buffers, and batch_get_into_multi_buffers. Leaving these as unimplemented stubs in a feature branch can lead to unexpected behavior or runtime errors if they are called. These should be implemented or explicitly marked as unsupported (e.g., by throwing an exception).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These rpc will be implemented in later pr.
| if (ping_fail_count >= max_ping_fail_count) { | ||
| LOG(ERROR) << "Ping failed " << ping_fail_count | ||
| << " times, reconnecting..."; | ||
| // TODO: Need to realize reconnect logic here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ping_thread_main function contains a // TODO: Need to realize reconnect logic here. This indicates that the high-availability/reconnection logic for the DummyClient is incomplete. Without this logic, the client won't be able to recover from connection failures to the RealClient, which undermines the resilience goals of this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be solved in later pr.
| // 新增一个函数用于获取buffer信息(地址和大小) | ||
| std::tuple<uint64_t, size_t> get_buffer_info(const std::string &key); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is in Chinese, which is inconsistent with the English used in the rest of the codebase. To improve maintainability and ensure all developers can understand the code, it's best to use a single language for comments.
// Get buffer information (address and size) for a key.
std::tuple<uint64_t, size_t> get_buffer_info(const std::string &key);There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
891cb46 to
761459d
Compare
|
nice work! Perhaps you could add some docs to introduce the standalone |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request implements an independent deployment architecture for the Mooncake Store Client, splitting it into RealClient (independent process) and DummyClient (lightweight proxy). The implementation provides full backward compatibility while enabling detached deployment mode with zero-copy performance through shared memory.
Key Changes:
- Split client architecture into
RealClient(server) andDummyClient(proxy) with shared memory-based zero-copy communication - Renamed
client.cpp/client.htoclient_service.cpp/client_service.handpybind_clienttoreal_client - Added
pyclient.hbase class for interface compatibility anddummy_clientimplementation - Introduced
mooncake_clientbinary for standalone RealClient deployment
Reviewed Changes
Copilot reviewed 17 out of 18 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| mooncake-store/src/real_client.cpp | Implements RealClient with RPC service registration, shared memory management, and virtual address translation |
| mooncake-store/src/dummy_client.cpp | Implements DummyClient RPC proxy with shared memory creation and ping-based health checking |
| mooncake-store/src/real_client_main.cpp | Main entry point for standalone RealClient service |
| mooncake-store/include/real_client.h | RealClient class definition with shared memory and dummy client monitoring |
| mooncake-store/include/dummy_client.h | DummyClient class definition with RPC client infrastructure |
| mooncake-store/include/pyclient.h | Base class defining common interface for both client types |
| mooncake-store/include/client_service.h | Renamed from client.h, defines core Client service class |
| mooncake-store/src/client_buffer.cpp | Added external buffer support for shared memory allocation |
| mooncake-store/include/client_buffer.hpp | Extended to support both heap and shared memory allocation |
| mooncake-store/include/utils.h | Added utility functions for string-to-byte-size conversion and Python return value conversion |
| mooncake-integration/store/store_py.cpp | Updated Python bindings to support both RealClient and DummyClient modes |
| mooncake-store/src/CMakeLists.txt | Added mooncake_client binary target and updated source file references |
| mooncake-store/tests/*.cpp | Updated test files to use renamed headers and RealClient class |
Comments suppressed due to low confidence (7)
mooncake-store/src/real_client.cpp:800
- The
get_dummy_buffer_internalfunction returns a buffer handle with adjusted address (buffer_base - shm_addr_offset_). However, theBufferHandleobject created byget_bufferwill be immediately destroyed after this function returns since it's not stored anywhere. This means the buffer memory will be deallocated while the returned address is still being used by the dummy client, leading to a use-after-free bug. Consider storing the BufferHandle or managing the buffer lifetime differently.
mooncake-store/src/real_client.cpp:660 - In
unmap_shm_internal,client_->unregisterLocalMemoryis called but the return value is not checked. If this operation fails, the function still proceeds to unmap the memory and clear all state, which could lead to inconsistent state. Consider checking the return value and handling errors appropriately.
mooncake-store/src/real_client.cpp:1652 - The
dummy_client_monitor_thread_is started but never properly joined or stopped in the destructor or teardown. WhenRealClientis destroyed, the thread may continue running and access member variables that have been destroyed, leading to undefined behavior. Consider adding proper thread cleanup in the destructor ortearDownAll_internal.
mooncake-store/src/real_client.cpp:603 - The
map_shm_internalfunction returnstl::expected<void, ErrorCode>but throws exceptions on error instead of returning error codes. This is inconsistent with the expected return type and breaks the error handling pattern. Consider replacingthrow std::runtime_error(...)withreturn tl::unexpected(ErrorCode::...)to properly propagate errors.
mooncake-store/src/real_client.cpp:306 - In the
tearDownAll_internalmethod,shm_name_,shm_addr_offset_, andshm_size_are cleared butshm_buffer_is not unmapped or cleared. This can lead to a resource leak. The method should callmunmap(shm_buffer_, shm_size_)and setshm_buffer_ = nullptrbefore clearing the related fields, or should callunmap_shm_internal()which already handles this cleanup.
mooncake-store/src/real_client.cpp:716 - In
unregister_shm_buffer_internal, after erasing the iteratoritat line 708, the function returnsit->secondat line 716. This is a use-after-erase bug that will result in undefined behavior. The value should be saved before erasing the iterator.
mooncake-store/include/real_client.h:408 - The member variable
shm_buffer_at line 408 is not initialized. If used beforemap_shm_internalis called, it will contain garbage values. Consider initializing it tonullptrto avoid undefined behavior.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
mooncake-store/src/dummy_client.cpp
Outdated
| ping_running_ = true; | ||
| ping_thread_ = std::thread([this]() mutable { this->ping_thread_main(); }); |
Copilot
AI
Nov 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ping thread is started before the shared memory setup completes. If shared memory setup fails (lines 139-183), the thread continues running and accessing client_id_ which may cause issues. Consider starting the ping thread after successful shared memory setup, or stopping it in the error handling paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| } | ||
| ResourceTracker::getInstance().registerInstance( | ||
| std::dynamic_pointer_cast<PyClient>(self.store_)); |
Copilot
AI
Nov 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The ResourceTracker::getInstance().registerInstance() is called with a dynamically cast pointer, but for DummyClient this will succeed. However, ResourceTracker is designed to call cleanup on PyClient instances which may cause issues if the cleanup logic differs. Consider verifying that ResourceTracker properly handles both RealClient and DummyClient lifecycle management.
| } | |
| ResourceTracker::getInstance().registerInstance( | |
| std::dynamic_pointer_cast<PyClient>(self.store_)); | |
| ResourceTracker::getInstance().registerInstance( | |
| std::dynamic_pointer_cast<PyClient>(self.store_)); | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It handles,just ignore this comment。
mooncake-store/src/dummy_client.cpp
Outdated
|
|
||
| for (const auto& result : internal_results) { | ||
| if (result.has_value()) { | ||
| results.push_back(result.value()); |
Copilot
AI
Nov 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In batchIsExist, when a result has a value (line 327), it pushes result.value() directly. However, result.value() is a bool, but the return type should be 1 or 0 (or negative for errors) based on the pattern in isExist. This should be result.value() ? 1 : 0 to match the expected return format.
| results.push_back(result.value()); | |
| results.push_back(result.value() ? 1 : 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| auto put_result = store_->put_parts(key, values); | ||
| if (!put_result) { | ||
| return -static_cast<int>(put_result.error()); | ||
| return -static_cast<int>(ErrorCode::INTERNAL_ERROR); | ||
| } |
Copilot
AI
Nov 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error handling at line 353-355 is inconsistent with the previous implementation. The put_parts method returns an int (0 on success, negative on error), not a tl::expected object, so the check if (!put_result) is incorrect. The code should check if (put_result != 0) and return put_result directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Thx for your comment. I've just added some introduction in doc. |
e38dbf1 to
0eb7ec3
Compare
ykwd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! I’ve reviewed part of it, and these are my comments. I’ll finish reviewing the remaining sections as soon as possible.
| The `Client` can be used in two modes: | ||
| 1. **Embedded mode**: Runs in the same process as the LLM inference program (e.g., a vLLM instance), by being imported as a shared library. | ||
| 2. **Standalone mode**: Runs as an independent process. | ||
| 2. **Standalone mode**: Runs as an independent process. In this mode, the `Client` is separated into two parts: a **dummy** `Client` and a **real** `Client`: The **real** `Client` is a full-featured implementation that runs as a standalone process and directly communicates with other Mooncake Store components. It handles all RPC communications, memory management, and data transfer operations. The **real** `Client` is typically deployed on nodes that contribute memory to the distributed cache pool; The **dummy** `Client` is a lightweight wrapper that forwards all operations to a local **real** `Client` via RPC calls, which is designed for scenarios where the client needs to be embedded in the same process as the application (such as vLLM), but the actual Mooncake Store operations should be handled by a standalone process. The **dummy** `Client` and the **real** `Client` communicate via RPC calls and shared memory to make sure that Zero-copy transfers are still possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also sync the update to docs/source/design/mooncake-store.md?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
mooncake-store/src/client_buffer.cpp
Outdated
| // Free the aligned allocated memory | ||
| if (buffer_) { | ||
| // Free the aligned allocated memory or unmap shared memory | ||
| if (!is_external_memory_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also need to check buffer_?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
f2af7c0 to
f73b712
Compare
| const std::vector<std::string> &keys) { | ||
| const auto kNullString = pybind11::bytes("\\0", 0); | ||
| if (!store_ || !store_->client_) { | ||
| if (!store_ || (!use_dummy_client_ && !store_->client_)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a bug if use_dummy_client_ but client is not initialized,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dummy does not have client_ and it will never use it.
| LOG(ERROR) << "Shared memory is not registered"; | ||
| return -1; | ||
| } | ||
| if (registered_size_ + local_buffer_size_ + size > shm_size_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There needs a annotation. the shm_size is only used for sglang and local buffer, not include global segment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I‘ve changed the "global_buffer" into "mem_pool" in dummy setup to avoid misunderstanding.
mooncake-store/src/dummy_client.cpp
Outdated
| easylog::set_min_severity(easylog::Severity::WARN); | ||
| // Initialize client pools | ||
| coro_io::client_pool<coro_rpc::coro_rpc_client>::pool_config pool_conf{}; | ||
| const char* value = std::getenv("MC_RPC_PROTOCOL"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there needs a error handler when env is not present. And now only support rdma protocol with coro?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
mooncake-store/src/dummy_client.cpp
Outdated
|
|
||
| shm_name_ = "/dummy_client_shm_" + std::to_string(client_id_.first) + "_" + | ||
| std::to_string(client_id_.second); | ||
| shm_size_ = local_buffer_size + global_segment_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This value contains both local_buffer_size and global_segment_size. seems conflict with the meaning in alloc_from_mem_pool? And the sglang L2 cache is dismissed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I‘ve changed the "global_buffer" into "mem_pool" in dummy setup to avoid misunderstanding.
|
|
||
| // Inform the RealClient about the shared memory | ||
| ret = register_shm(shm_name_, reinterpret_cast<uint64_t>(shm_base_addr_), | ||
| shm_size_, local_buffer_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when recovering from a crash, the setup is also called. And the new shmem will be the same as the last time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A new method of remapping after a crash needs to be realized. In this new method, shm should be mapped from the fd not the name.
| } | ||
|
|
||
| // Dummy only register buffer within the shared memory region | ||
| int DummyClient::register_buffer(void* buffer, size_t size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
register_buffer and alloc_from_mem_pool can be combined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alloc_from_mem_pool is a alloc method, if sglang/vllm alloc success, it will use register_buffer to notify.
Signed-off-by: Xingrui Yi <[email protected]>
Signed-off-by: Xingrui Yi <[email protected]>
Signed-off-by: Xingrui Yi <[email protected]>
Signed-off-by: Xingrui Yi <[email protected]>
ykwd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, it looks good to me. Just a few comments
| py::arg("protocol"), py::arg("rdma_devices"), | ||
| py::arg("master_server_addr"), py::arg("engine") = py::none()) | ||
| py::arg("master_server_addr"), py::arg("engine") = py::none(), | ||
| py::arg("use_dummy_client") = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better for future extensibility (e.g., supporting new architectures) if we change this parameter to a string-type client_type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just add new setup func for dummy.
| } | ||
|
|
||
| pybind11::list batch_get_tensor(const std::vector<std::string> &keys) { | ||
| if (!store_ || !store_->client_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For currently unsupported APIs, it would be better to explicitly return an error. Otherwise, the invocation will fail and it might not be easy to debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just added more log for that.
|
|
||
| pybind11::bytes get(const std::string &key) { | ||
| if (!store_ || !store_->client_) { | ||
| if (!store_ || (!use_dummy_client_ && !store_->client_)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to abstract this if condition into a separate function and include a comment inside the function explaining why the condition is checked this way? As it stands, the reasoning behind this condition isn’t very intuitive, and some interfaces don’t include the use_dummy_client_ part of the condition. After some time, future maintainers may not understand what this logic is meant for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Signed-off-by: Xingrui Yi <[email protected]>
Description
Thanks @wanyue-wy @stmatengss @XucSh for helping with architectural design.
Type of Change
Architecture Overview
RealClientandDummyClient, ensuring that performance is on par with the non-detached mode.RealClientcontinues to run and provide Storage for the Master even after theDummyClientexits.Code Modification Overview
File Creation/Change Description
client.cpp&client.htoclient_service.cpp&client_service.h.real_client_main.cppas the main process for theRealClient.dummy_client.cpp&dummy_client.hfor theDummyClient.pybind_client.cpp&pybind_client.htoreal_client.cpp&real_client.h. All instances ofclass PyClienthave been renamed toclass RealClient.pyclient.h, which defines a new base classPyClient.PyClientmaintains the interface to ensure backward compatibility. BothRealClient&DummyClientinherit from it and implement the corresponding functions.Key Implementation/Change Details
mooncake-integration/store/store_py.cpp
storeinstore_pyis defined as aPyClientbase class pointer. If the client detached mode is selected, thestoreis instantiated as aDummyClient; otherwise, it's instantiated as aRealClient.mooncake-store/src/dummy_client.cpp
DummyClientImplementation: Implements theDummyClient, which is primarily responsible for connecting to theRealClientand initiating RPC calls. Its high-level interface inherits fromPyClientto remain consistent with the old code.DummyClientis responsible for creating the shared memory. During its setup phase, it creates the shared memory and notifies theRealClienttommapit and register it with the transfer engine.a. The shared memory created by the
DummyClientprovides memory space for the L2 cache (likemem_pool_hostin SGLang) and the client's local buffer (held byclient_buffer_allocator_).b.
shm_size = local_buffer_size + mem_pool_host_size. The first half of the shared memory is used for the L2 cache, and the second half is for the client's local buffer.c. During the
DummyClientsetup phase, theRealClientis notified to register the entire shared memory block with the transfer engine at once. Therefore, there is no need to later callregisterLocalMemoryseparately for the L2 cache.d. After notifying the
RealClienttommapthe shared memory, theDummyClientimmediately callsshm_unlink. This removes the shared memory file, preventing any other external processes from mapping it. Since theRealClienthas already completed themmap, it can continue to access this memory. Due to the kernel's automatic reclamation mechanism for shared memory, the memory will be automatically freed when its reference count drops to zero (i.e., when bothDummyClientandRealClientunmap it).DummyClienthas a Ping thread to maintain its connection with theRealClient.mooncake-store/src/real_client.cpp
RealClientreuses all the implementation from the originalpybind_client(including creating theglobal_segment, holding the transfer engine, communicating with the master, etc.).RealClientregisters its internal functions with an RPC service so they can be called by theDummyClient.a. When notified by the
DummyClientto map the shared memory, theRealClientwillmmapit and record both theDummyClient's side VA (Virtual Address) and its ownRealClientside VA. This facilitates virtual address translation in subsequent zero-copy interface calls.b. The
RealClientuses the second half of the shared memory to create itsclient_buffer_allocator_.RealClient, when handling these functions, must use the previously recorded VAs to perform virtual address translation. This ensures that the virtual addresses it uses are valid in its own context, and the virtual addresses it returns to theDummyClientare valid in theDummyClient's context.RealClientuses a Ping mechanism to continuously monitor the status of theDummyClient. If it detects that theDummyClientis unavailable, it will unmap the corresponding shared memory to ensure timely memory cleanup.mooncake-store/src/client_buffer.cpp
ClientBufferAllocatornow supports being constructed with an external buffer. This change is made to accommodate the requirement of creating aclient_buffer_allocator_using the pre-allocated shared memory.How Has This Been Tested?
To support detached Client mode, the patch described above must be applied to Sglang. (Without the patch, Sglang can still start normally in non-detached mode).
Launching in Detached Mode
a. Start the master:
b. Start the
RealClient(listens on port 50052 by default):c. Start Sglang. Set
MOONCAKE_MASTERto theRealClient's address andMOONCAKE_STANDALONE_STORAGE=1to enable standalone client deployment.MOONCAKE_MASTER=127.0.0.1:50052 \ MOONCAKE_STANDALONE_STORAGE=1 \ python3 -m sglang.launch_server \ --enable-hierarchical-cache \ --hicache-storage-backend mooncake \ --model-path Qwen/Qwen3-0.6BChecklist
TODO List
Split Client into DummyClient and RealClient.
Demo for SGLang.
Add CI for DummyClient.
Enable a single
RealClientto serve multipleDummyClients.Add support for RPC metric collection in
RealClientandDummyClient.Implement a reconnection mechanism for
RealClientandDummyClient.Expand
DummyClientinterface support.a. Currently supported:
putput_batchput_partsremoveremoveByRegexremoveAllisExistbatchIsExistgetSizeget_buffer(replaced byget_buffer_info)batch_put_frombatch_get_intoUnsupported:
batch_get_bufferget_intoput_fromput_from_with_metadatabatch_put_from_multi_buffersbatch_get_into_multi_buffersget_tensorput_tensorbatch_put_tensorbatch_get_tensor