Skip to content

Conversation

@YiXR
Copy link

@YiXR YiXR commented Nov 20, 2025

Description

Thanks @wanyue-wy @stmatengss @XucSh for helping with architectural design.

Type of Change

Architecture Overview

  1. Split Client into DummyClient and RealClient, which RealClient is an independent process.
  2. Full Compatibility: Fully compatible with the old arch.
  3. Flexible Deployment: After applying the detached deployment patch to Sglang(tmp patch:https://github.com/YiXR/sglang/tree/xinyi-client, the Client detached deployment mode can be enabled via configuration. The non-detached mode can still be chosen.
  4. Zero-Copy Performance: Introduces shared memory to support zero-copy operations between RealClient and DummyClient, ensuring that performance is on par with the non-detached mode.
  5. Resilience: The RealClient continues to run and provide Storage for the Master even after the DummyClient exits.

Code Modification Overview

File Creation/Change Description

  1. Renamed client.cpp & client.h to client_service.cpp & client_service.h.
  2. Added real_client_main.cpp as the main process for the RealClient.
  3. Added dummy_client.cpp & dummy_client.h for the DummyClient.
  4. Renamed pybind_client.cpp & pybind_client.h to real_client.cpp & real_client.h. All instances of class PyClient have been renamed to class RealClient.
  5. Added pyclient.h, which defines a new base class PyClient. PyClient maintains the interface to ensure backward compatibility. Both RealClient & DummyClient inherit from it and implement the corresponding functions.

Key Implementation/Change Details

mooncake-integration/store/store_py.cpp

  1. This is the entry point for the Python compatibility layer, where the decision to use detached deployment is made. The store in store_py is defined as a PyClient base class pointer. If the client detached mode is selected, the store is instantiated as a DummyClient; otherwise, it's instantiated as a RealClient.

mooncake-store/src/dummy_client.cpp

  1. DummyClient Implementation: Implements the DummyClient, which is primarily responsible for connecting to the RealClient and initiating RPC calls. Its high-level interface inherits from PyClient to remain consistent with the old code.
  2. Shared Memory Creation: The DummyClient is responsible for creating the shared memory. During its setup phase, it creates the shared memory and notifies the RealClient to mmap it and register it with the transfer engine.
  3. Shared Memory Details:
    a. The shared memory created by the DummyClient provides memory space for the L2 cache (like mem_pool_host in SGLang) and the client's local buffer (held by client_buffer_allocator_).
    b. shm_size = local_buffer_size + mem_pool_host_size. The first half of the shared memory is used for the L2 cache, and the second half is for the client's local buffer.
    c. During the DummyClient setup phase, the RealClient is notified to register the entire shared memory block with the transfer engine at once. Therefore, there is no need to later call registerLocalMemory separately for the L2 cache.
    d. After notifying the RealClient to mmap the shared memory, the DummyClient immediately calls shm_unlink. This removes the shared memory file, preventing any other external processes from mapping it. Since the RealClient has already completed the mmap, it can continue to access this memory. Due to the kernel's automatic reclamation mechanism for shared memory, the memory will be automatically freed when its reference count drops to zero (i.e., when both DummyClient and RealClient unmap it).
  4. Keep-Alive: The DummyClient has a Ping thread to maintain its connection with the RealClient.

mooncake-store/src/real_client.cpp

  1. Code Reuse: The RealClient reuses all the implementation from the original pybind_client (including creating the global_segment, holding the transfer engine, communicating with the master, etc.).
  2. RPC Service: The RealClient registers its internal functions with an RPC service so they can be called by the DummyClient.
  3. Shared Memory Details:
    a. When notified by the DummyClient to map the shared memory, the RealClient will mmap it and record both the DummyClient's side VA (Virtual Address) and its own RealClient side VA. This facilitates virtual address translation in subsequent zero-copy interface calls.
    b. The RealClient uses the second half of the shared memory to create its client_buffer_allocator_.
  4. Virtual Address Translation: For RPCs involving zero-copy or using the client local buffer for transfers, virtual addresses are passed as parameters/return values. The RealClient, when handling these functions, must use the previously recorded VAs to perform virtual address translation. This ensures that the virtual addresses it uses are valid in its own context, and the virtual addresses it returns to the DummyClient are valid in the DummyClient's context.
  5. Health Check: The RealClient uses a Ping mechanism to continuously monitor the status of the DummyClient. If it detects that the DummyClient is unavailable, it will unmap the corresponding shared memory to ensure timely memory cleanup.

mooncake-store/src/client_buffer.cpp

  1. External Buffer Support: ClientBufferAllocator now supports being constructed with an external buffer. This change is made to accommodate the requirement of creating a client_buffer_allocator_ using the pre-allocated shared memory.
  • Types
    • Bug fix
    • New feature
      • Transfer Engine
      • Mooncake Store
      • Mooncake EP
      • Integration
      • P2P Store
      • Python Wheel
    • Breaking change
    • CI/CD
    • Documentation update
    • Other

How Has This Been Tested?

  1. To support detached Client mode, the patch described above must be applied to Sglang. (Without the patch, Sglang can still start normally in non-detached mode).

  2. Launching in Detached Mode
    a. Start the master:

    mooncake_master --enable_http_metadata_server=true --eviction_high_watermark_ratio=0.95

    b. Start the RealClient (listens on port 50052 by default):

    mooncake_client --global_segment_size=4GB

    c. Start Sglang. Set MOONCAKE_MASTER to the RealClient's address and MOONCAKE_STANDALONE_STORAGE=1 to enable standalone client deployment.

    MOONCAKE_MASTER=127.0.0.1:50052 \
    MOONCAKE_STANDALONE_STORAGE=1 \
    python3 -m sglang.launch_server \
        --enable-hierarchical-cache \
        --hicache-storage-backend mooncake \
        --model-path Qwen/Qwen3-0.6B

Checklist

  • I have performed a self-review of my own code.
  • I have updated the documentation.
  • I have added tests to prove my changes are effective.

TODO List

  • Split Client into DummyClient and RealClient.

  • Demo for SGLang.

  • Add CI for DummyClient.

  • Enable a single RealClient to serve multiple DummyClients.

  • Add support for RPC metric collection in RealClient and DummyClient.

  • Implement a reconnection mechanism for RealClient and DummyClient.

  • Expand DummyClient interface support.
    a. Currently supported:

    • put
    • put_batch
    • put_parts
    • remove
    • removeByRegex
    • removeAll
    • isExist
    • batchIsExist
    • getSize
    • get_buffer (replaced by get_buffer_info)
    • batch_put_from
    • batch_get_into
      Unsupported:
    • batch_get_buffer
    • get_into
    • put_from
    • put_from_with_metadata
    • batch_put_from_multi_buffers
    • batch_get_into_multi_buffers
    • get_tensor
    • put_tensor
    • batch_put_tensor
    • batch_get_tensor

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @YiXR, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant architectural enhancement by decoupling the client-side logic into independent processes. This change allows for more flexible deployment scenarios and improves the resilience of the storage system. By leveraging shared memory for inter-process communication, the new design maintains high performance, specifically enabling zero-copy data transfers between the client components. The refactoring also sets the stage for future scalability and robustness improvements.

Highlights

  • Independent Client Deployment: Introduced a new architecture allowing the client to be deployed independently from the main application, enabling greater flexibility and resilience. This can be configured to use a 'detached deployment mode'.
  • Zero-Copy Performance with Shared Memory: Implemented shared memory mechanisms to facilitate zero-copy operations between the new 'RealClient' and 'DummyClient' components, ensuring that performance remains on par with non-detached modes.
  • Enhanced Resilience: The 'RealClient' is designed to continue running and providing storage services to the Master even if the 'DummyClient' process exits, improving overall system stability.
  • Client Refactoring and New Components: The existing client implementation has been refactored into a base class PyClient, with two concrete implementations: RealClient (the core logic) and DummyClient (a lightweight proxy). Key files like client.cpp and pybind_client.cpp were renamed, and new files such as real_client_main.cpp and dummy_client.cpp were added.
  • Dynamic Client Instantiation: The Python compatibility layer (store_py.cpp) now dynamically instantiates either a DummyClient or RealClient based on configuration, providing a seamless transition for existing Python interfaces.
  • Shared Memory Management in RealClient: The RealClient now includes RPC services to manage shared memory regions, handling mapping, unmapping, and registration of buffers within these shared segments, along with virtual address translation for zero-copy transfers.
  • DummyClient Health Monitoring: The RealClient incorporates a ping mechanism to continuously monitor the status of its associated DummyClient, ensuring timely cleanup of shared memory resources if the DummyClient becomes unresponsive.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant architectural change to support independent deployment of the client, using a DummyClient and RealClient model with shared memory for zero-copy communication. While the overall structure is well-defined, the implementation has several critical issues that need to be addressed. My review identified a logic bug in error handling within the Python bindings, multiple key functions in DummyClient that are left as unimplemented TODOs, and incomplete high-availability logic. Most critically, there's a design flaw in how RealClient monitors and cleans up resources for DummyClient instances, which would fail in a multi-client scenario. I've detailed these points in the specific comments.

Comment on lines +1596 to +1637
void RealClient::dummy_client_monitor_func() {
std::unordered_map<UUID, std::chrono::steady_clock::time_point,
boost::hash<UUID>>
client_ttl;
while (dummy_client_monitor_running_) {
auto now = std::chrono::steady_clock::now();

// Update the client ttl
PodUUID pod_client_id;
while (dummy_client_ping_queue_.pop(pod_client_id)) {
UUID client_id = {pod_client_id.first, pod_client_id.second};
client_ttl[client_id] =
now + std::chrono::seconds(dummy_client_live_ttl_sec_);
}

// Find out expired clients
std::vector<UUID> expired_clients;
for (auto it = client_ttl.begin(); it != client_ttl.end();) {
if (it->second < now) {
LOG(INFO) << "client_id=" << it->first
<< ", action=client_expired";
expired_clients.push_back(it->first);
it = client_ttl.erase(it);
} else {
++it;
}
}

// Update the client status to NEED_REMOUNT
if (!expired_clients.empty()) {
{
// Lock client_mutex and segment_mutex
std::unique_lock<std::shared_mutex> lock(dummy_client_mutex_);
// Unmap shm segments associated with this client
// TODO: need to optimize this with client_id
unmap_shm_internal();
} // Release the mutex before long-running ClearInvalidHandles and
// avoid deadlocks
}

std::this_thread::sleep_for(
std::chrono::milliseconds(kDummyClientMonitorSleepMs));
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The dummy_client_monitor_func has a design flaw that prevents it from correctly handling multiple DummyClient instances. When a DummyClient is detected as expired, unmap_shm_internal() is called without any client-specific identifier. This function cleans up the shared memory segment information stored in RealClient members like shm_buffer_, shm_name_, etc. Since these are single-instance members, this action will affect all connected DummyClients, not just the expired one. This can lead to memory corruption, crashes, or data loss if multiple dummy clients are used.

The implementation should be updated to manage shared memory resources on a per-client basis. The // TODO comment on line 1630 also indicates this is an incomplete implementation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be solved later. I'll introduce another patch for multi dummy support.

Comment on lines 353 to 356
auto put_result = store_->put_parts(key, values);
if (!put_result) {
return -static_cast<int>(put_result.error());
return -static_cast<int>(ErrorCode::INTERNAL_ERROR);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There's a logic error in handling the return value from store_->put_parts. The put_parts function returns 0 on success and a negative error code on failure. The current code if (!put_result) evaluates to true on success (when put_result is 0), causing it to incorrectly return an error. This should be if (put_result != 0).

Additionally, on failure, it returns a generic ErrorCode::INTERNAL_ERROR instead of propagating the specific error code contained in put_result.

Suggested change
auto put_result = store_->put_parts(key, values);
if (!put_result) {
return -static_cast<int>(put_result.error());
return -static_cast<int>(ErrorCode::INTERNAL_ERROR);
}
auto put_result = store_->put_parts(key, values);
if (put_result != 0) {
return put_result;
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


std::vector<std::shared_ptr<BufferHandle>> DummyClient::batch_get_buffer(
const std::vector<std::string>& keys) {
// TODO: implement this function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Several functions in DummyClient are not implemented and are marked with // TODO: implement this function. This includes batch_get_buffer, get_into, put_from, put_from_with_metadata, batch_put_from_multi_buffers, and batch_get_into_multi_buffers. Leaving these as unimplemented stubs in a feature branch can lead to unexpected behavior or runtime errors if they are called. These should be implemented or explicitly marked as unsupported (e.g., by throwing an exception).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These rpc will be implemented in later pr.

if (ping_fail_count >= max_ping_fail_count) {
LOG(ERROR) << "Ping failed " << ping_fail_count
<< " times, reconnecting...";
// TODO: Need to realize reconnect logic here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The ping_thread_main function contains a // TODO: Need to realize reconnect logic here. This indicates that the high-availability/reconnection logic for the DummyClient is incomplete. Without this logic, the client won't be able to recover from connection failures to the RealClient, which undermines the resilience goals of this change.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be solved in later pr.

Comment on lines 72 to 73
// 新增一个函数用于获取buffer信息(地址和大小)
std::tuple<uint64_t, size_t> get_buffer_info(const std::string &key);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This comment is in Chinese, which is inconsistent with the English used in the rest of the codebase. To improve maintainability and ensure all developers can understand the code, it's best to use a single language for comments.

    // Get buffer information (address and size) for a key.
    std::tuple<uint64_t, size_t> get_buffer_info(const std::string &key);

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@YiXR YiXR force-pushed the split-client branch 2 times, most recently from 891cb46 to 761459d Compare November 20, 2025 11:29
@yejj710
Copy link
Contributor

yejj710 commented Nov 20, 2025

nice work! Perhaps you could add some docs to introduce the standalone store_client deployment mode. This would make it easier for users to understand how to use it.

@stmatengss stmatengss requested a review from Copilot November 20, 2025 15:52
Copilot finished reviewing on behalf of stmatengss November 20, 2025 15:55
Copilot finished reviewing on behalf of stmatengss November 20, 2025 15:55
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request implements an independent deployment architecture for the Mooncake Store Client, splitting it into RealClient (independent process) and DummyClient (lightweight proxy). The implementation provides full backward compatibility while enabling detached deployment mode with zero-copy performance through shared memory.

Key Changes:

  • Split client architecture into RealClient (server) and DummyClient (proxy) with shared memory-based zero-copy communication
  • Renamed client.cpp/client.h to client_service.cpp/client_service.h and pybind_client to real_client
  • Added pyclient.h base class for interface compatibility and dummy_client implementation
  • Introduced mooncake_client binary for standalone RealClient deployment

Reviewed Changes

Copilot reviewed 17 out of 18 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
mooncake-store/src/real_client.cpp Implements RealClient with RPC service registration, shared memory management, and virtual address translation
mooncake-store/src/dummy_client.cpp Implements DummyClient RPC proxy with shared memory creation and ping-based health checking
mooncake-store/src/real_client_main.cpp Main entry point for standalone RealClient service
mooncake-store/include/real_client.h RealClient class definition with shared memory and dummy client monitoring
mooncake-store/include/dummy_client.h DummyClient class definition with RPC client infrastructure
mooncake-store/include/pyclient.h Base class defining common interface for both client types
mooncake-store/include/client_service.h Renamed from client.h, defines core Client service class
mooncake-store/src/client_buffer.cpp Added external buffer support for shared memory allocation
mooncake-store/include/client_buffer.hpp Extended to support both heap and shared memory allocation
mooncake-store/include/utils.h Added utility functions for string-to-byte-size conversion and Python return value conversion
mooncake-integration/store/store_py.cpp Updated Python bindings to support both RealClient and DummyClient modes
mooncake-store/src/CMakeLists.txt Added mooncake_client binary target and updated source file references
mooncake-store/tests/*.cpp Updated test files to use renamed headers and RealClient class
Comments suppressed due to low confidence (7)

mooncake-store/src/real_client.cpp:800

  • The get_dummy_buffer_internal function returns a buffer handle with adjusted address (buffer_base - shm_addr_offset_). However, the BufferHandle object created by get_buffer will be immediately destroyed after this function returns since it's not stored anywhere. This means the buffer memory will be deallocated while the returned address is still being used by the dummy client, leading to a use-after-free bug. Consider storing the BufferHandle or managing the buffer lifetime differently.
    mooncake-store/src/real_client.cpp:660
  • In unmap_shm_internal, client_->unregisterLocalMemory is called but the return value is not checked. If this operation fails, the function still proceeds to unmap the memory and clear all state, which could lead to inconsistent state. Consider checking the return value and handling errors appropriately.
    mooncake-store/src/real_client.cpp:1652
  • The dummy_client_monitor_thread_ is started but never properly joined or stopped in the destructor or teardown. When RealClient is destroyed, the thread may continue running and access member variables that have been destroyed, leading to undefined behavior. Consider adding proper thread cleanup in the destructor or tearDownAll_internal.
    mooncake-store/src/real_client.cpp:603
  • The map_shm_internal function returns tl::expected<void, ErrorCode> but throws exceptions on error instead of returning error codes. This is inconsistent with the expected return type and breaks the error handling pattern. Consider replacing throw std::runtime_error(...) with return tl::unexpected(ErrorCode::...) to properly propagate errors.
    mooncake-store/src/real_client.cpp:306
  • In the tearDownAll_internal method, shm_name_, shm_addr_offset_, and shm_size_ are cleared but shm_buffer_ is not unmapped or cleared. This can lead to a resource leak. The method should call munmap(shm_buffer_, shm_size_) and set shm_buffer_ = nullptr before clearing the related fields, or should call unmap_shm_internal() which already handles this cleanup.
    mooncake-store/src/real_client.cpp:716
  • In unregister_shm_buffer_internal, after erasing the iterator it at line 708, the function returns it->second at line 716. This is a use-after-erase bug that will result in undefined behavior. The value should be saved before erasing the iterator.
    mooncake-store/include/real_client.h:408
  • The member variable shm_buffer_ at line 408 is not initialized. If used before map_shm_internal is called, it will contain garbage values. Consider initializing it to nullptr to avoid undefined behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 132 to 133
ping_running_ = true;
ping_thread_ = std::thread([this]() mutable { this->ping_thread_main(); });
Copy link

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ping thread is started before the shared memory setup completes. If shared memory setup fails (lines 139-183), the thread continues running and accessing client_id_ which may cause issues. Consider starting the ping thread after successful shared memory setup, or stopping it in the error handling paths.

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 575 to 587
}
ResourceTracker::getInstance().registerInstance(
std::dynamic_pointer_cast<PyClient>(self.store_));
Copy link

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The ResourceTracker::getInstance().registerInstance() is called with a dynamically cast pointer, but for DummyClient this will succeed. However, ResourceTracker is designed to call cleanup on PyClient instances which may cause issues if the cleanup logic differs. Consider verifying that ResourceTracker properly handles both RealClient and DummyClient lifecycle management.

Suggested change
}
ResourceTracker::getInstance().registerInstance(
std::dynamic_pointer_cast<PyClient>(self.store_));
ResourceTracker::getInstance().registerInstance(
std::dynamic_pointer_cast<PyClient>(self.store_));
}

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It handles,just ignore this comment。


for (const auto& result : internal_results) {
if (result.has_value()) {
results.push_back(result.value());
Copy link

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In batchIsExist, when a result has a value (line 327), it pushes result.value() directly. However, result.value() is a bool, but the return type should be 1 or 0 (or negative for errors) based on the pattern in isExist. This should be result.value() ? 1 : 0 to match the expected return format.

Suggested change
results.push_back(result.value());
results.push_back(result.value() ? 1 : 0);

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 353 to 356
auto put_result = store_->put_parts(key, values);
if (!put_result) {
return -static_cast<int>(put_result.error());
return -static_cast<int>(ErrorCode::INTERNAL_ERROR);
}
Copy link

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error handling at line 353-355 is inconsistent with the previous implementation. The put_parts method returns an int (0 on success, negative on error), not a tl::expected object, so the check if (!put_result) is incorrect. The code should check if (put_result != 0) and return put_result directly.

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@YiXR YiXR changed the title feat[Store]: Add independent deployment implementation for Client feat[Store]: Add standalone deployment implementation for Client Nov 21, 2025
@YiXR
Copy link
Author

YiXR commented Nov 21, 2025

nice work! Perhaps you could add some docs to introduce the standalone store_client deployment mode. This would make it easier for users to understand how to use it.

Thx for your comment. I've just added some introduction in doc.

@YiXR YiXR force-pushed the split-client branch 5 times, most recently from e38dbf1 to 0eb7ec3 Compare November 21, 2025 09:13
Copy link
Collaborator

@ykwd ykwd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! I’ve reviewed part of it, and these are my comments. I’ll finish reviewing the remaining sections as soon as possible.

The `Client` can be used in two modes:
1. **Embedded mode**: Runs in the same process as the LLM inference program (e.g., a vLLM instance), by being imported as a shared library.
2. **Standalone mode**: Runs as an independent process.
2. **Standalone mode**: Runs as an independent process. In this mode, the `Client` is separated into two parts: a **dummy** `Client` and a **real** `Client`: The **real** `Client` is a full-featured implementation that runs as a standalone process and directly communicates with other Mooncake Store components. It handles all RPC communications, memory management, and data transfer operations. The **real** `Client` is typically deployed on nodes that contribute memory to the distributed cache pool; The **dummy** `Client` is a lightweight wrapper that forwards all operations to a local **real** `Client` via RPC calls, which is designed for scenarios where the client needs to be embedded in the same process as the application (such as vLLM), but the actual Mooncake Store operations should be handled by a standalone process. The **dummy** `Client` and the **real** `Client` communicate via RPC calls and shared memory to make sure that Zero-copy transfers are still possible.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also sync the update to docs/source/design/mooncake-store.md?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// Free the aligned allocated memory
if (buffer_) {
// Free the aligned allocated memory or unmap shared memory
if (!is_external_memory_) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also need to check buffer_?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@YiXR YiXR force-pushed the split-client branch 8 times, most recently from f2af7c0 to f73b712 Compare November 24, 2025 08:32
const std::vector<std::string> &keys) {
const auto kNullString = pybind11::bytes("\\0", 0);
if (!store_ || !store_->client_) {
if (!store_ || (!use_dummy_client_ && !store_->client_)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a bug if use_dummy_client_ but client is not initialized,

Copy link
Author

@YiXR YiXR Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dummy does not have client_ and it will never use it.

LOG(ERROR) << "Shared memory is not registered";
return -1;
}
if (registered_size_ + local_buffer_size_ + size > shm_size_) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There needs a annotation. the shm_size is only used for sglang and local buffer, not include global segment

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I‘ve changed the "global_buffer" into "mem_pool" in dummy setup to avoid misunderstanding.

easylog::set_min_severity(easylog::Severity::WARN);
// Initialize client pools
coro_io::client_pool<coro_rpc::coro_rpc_client>::pool_config pool_conf{};
const char* value = std::getenv("MC_RPC_PROTOCOL");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there needs a error handler when env is not present. And now only support rdma protocol with coro?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


shm_name_ = "/dummy_client_shm_" + std::to_string(client_id_.first) + "_" +
std::to_string(client_id_.second);
shm_size_ = local_buffer_size + global_segment_size;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value contains both local_buffer_size and global_segment_size. seems conflict with the meaning in alloc_from_mem_pool? And the sglang L2 cache is dismissed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I‘ve changed the "global_buffer" into "mem_pool" in dummy setup to avoid misunderstanding.


// Inform the RealClient about the shared memory
ret = register_shm(shm_name_, reinterpret_cast<uint64_t>(shm_base_addr_),
shm_size_, local_buffer_size);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when recovering from a crash, the setup is also called. And the new shmem will be the same as the last time?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A new method of remapping after a crash needs to be realized. In this new method, shm should be mapped from the fd not the name.

}

// Dummy only register buffer within the shared memory region
int DummyClient::register_buffer(void* buffer, size_t size) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

register_buffer and alloc_from_mem_pool can be combined?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alloc_from_mem_pool is a alloc method, if sglang/vllm alloc success, it will use register_buffer to notify.

@XucSh XucSh self-assigned this Nov 25, 2025
Copy link
Collaborator

@ykwd ykwd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, it looks good to me. Just a few comments

py::arg("protocol"), py::arg("rdma_devices"),
py::arg("master_server_addr"), py::arg("engine") = py::none())
py::arg("master_server_addr"), py::arg("engine") = py::none(),
py::arg("use_dummy_client") = false)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better for future extensibility (e.g., supporting new architectures) if we change this parameter to a string-type client_type?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just add new setup func for dummy.

}

pybind11::list batch_get_tensor(const std::vector<std::string> &keys) {
if (!store_ || !store_->client_) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For currently unsupported APIs, it would be better to explicitly return an error. Otherwise, the invocation will fail and it might not be easy to debug.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just added more log for that.


pybind11::bytes get(const std::string &key) {
if (!store_ || !store_->client_) {
if (!store_ || (!use_dummy_client_ && !store_->client_)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to abstract this if condition into a separate function and include a comment inside the function explaining why the condition is checked this way? As it stands, the reasoning behind this condition isn’t very intuitive, and some interfaces don’t include the use_dummy_client_ part of the condition. After some time, future maintainers may not understand what this logic is meant for.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@XucSh XucSh added run-ci and removed run-ci labels Nov 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants