Skip to content

Conversation

@richardhuo-nv
Copy link
Contributor

@richardhuo-nv richardhuo-nv commented Oct 22, 2025

Overview:

Having a dedicated KVBM pip wheel introduces several implications:

  1. A separate PyO3 bindings binary is now generated.
  2. A dedicated Docker build file is provided for building the wheel from source.
  3. Certain dependencies, such as ETCD, have been removed from KVBM.
  4. The DistributedRuntime is now optional within KVBM.

Details:

Changes in KVBM:

  1. Removed the ETCD-based leader–worker synchronization and replaced it with a pure ZMQ messaging system. Barrier control is now achieved by having each worker manage its own allocation state to ensure blocks are not reallocated twice.
  2. Introduced a dedicated PyO3 bindings binary for KVBM and removed the bindings code from dynamo-runtime.
  3. Made the DistributedRuntime optional in KVBM. Since Tokio thread pools cannot be passed across bindings, we now reinitialize the Tokio runtime thread pool within KVBM. This is also required when no DistributedRuntime is present, to provide a Tokio runtime environment for the bindings to operate.

Changes in CI:

  1. Added dockerfile.kvbm_wheel, allowing users to build the KVBM wheel directly from source. This Dockerfile is based on glibc 2.28 and manylinux. NIXL binaries are built in a glibc 2.28 environment to ensure compatibility with all glibc 2.28+ Linux systems.
  2. Integrated a KVBM wheel build process into build.sh. The generated KVBM wheel is copied into the Dynamo base image when ENABLE_KVBM is set to true.

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • New Features

    • Introduced new kvbm workspace package as a standalone KV cache block manager module.
    • Added support for ZMQ-based leader/worker coordination for improved distributed KV cache management.
    • Enabled optional distributed runtime integration for flexible deployment scenarios.
  • Bug Fixes

    • Replaced barrier-based synchronization with more reliable ZMQ-based communication patterns.
  • Chores

    • Refactored KV cache connectors to use new kvbm package paths.
    • Updated container builds to support optional kvbm wheel installation.

envs:
- name: DYN_KVBM_CPU_CACHE_GB
value: "100"
- name: DYN_KVBM_BARRIER_ID_PREFIX
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since DYN_KVBM_BARRIER_ID_PREFIX is removed, just want to make sure that with the new ZMQ sync impl, it would still work when there are multiple leaders in the system, e.g. more than 1 node using KVBM, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, leaders will not talk to ETCD anymore, they will just using the port in their own pod network space. So yeah, it still work.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 22, 2025

Walkthrough

This pull request extracts KV cache block management functionality into a standalone lib/kvbm crate with Python and Rust bindings, supporting vLLM and TensorRT-LLM runtimes. It introduces new build infrastructure for KVBM wheels, refactors distributed synchronization from barrier-based to ZMQ-based leader/worker coordination, and updates import paths across the codebase to reference the new kvbm namespace.

Changes

Cohort / File(s) Summary
Workspace & Core Project Setup
Cargo.toml, lib/kvbm/Cargo.toml, lib/kvbm/pyproject.toml, lib/kvbm/LICENSE, lib/kvbm/README.md
Added lib/kvbm as new workspace member; created new Python/Rust crate with comprehensive metadata, dependencies, build configuration via maturin, and documentation.
Build Infrastructure & Containerization
container/Dockerfile, container/Dockerfile.kvbm_wheel, container/Dockerfile.trtllm, container/Dockerfile.vllm, container/build.sh, container/build_kvbm_wheel.sh
Introduced KVBM build flag and optional wheel installation; created dedicated KVBM wheel Dockerfile with CUDA 13, OpenSSL, gRPC, etcd-cpp, UCX, and Nixl builds; added build script with new --kvbm-pip-wheel-dir option and build_kvbm_wheel.sh.
CUDA 13 & Runtime Deps
container/deps/kvbm/install_cuda13.sh, container/deps/trtllm/install_nixl.sh
Added CUDA 13 installation script for RHEL8 systems with architecture normalization and GPG validation; minor formatting fix in Nixl install script.
KVBM Python Package (Init & Type Stubs)
lib/kvbm/python/kvbm/__init__.py, lib/kvbm/python/kvbm/_core.pyi, lib/kvbm/python/kvbm/utils.py
Re-exported BlockManager, KvbmLeader, KvbmWorker from _core; added type stubs for Layer, Block, BlockList, BlockManager, KvbmCacheManager, KvbmRequest; introduced utilities for CUDA 13 detection, Nixl plugin setup, and dynamic runtime enablement.
KVBM TensorRT-LLM Integration
lib/kvbm/python/kvbm/trtllm_integration/connector/kvbm_connector_leader.py, kvbm_connector_worker.py, lib/kvbm/python/kvbm/trtllm_integration/rust.py
Updated import paths to kvbm namespace; added conditional dynamic runtime support with DistributedRuntime placeholder and gating via is_dyn_runtime_enabled(); changed runtime initialization logic.
KVBM vLLM Integration (Python)
lib/kvbm/python/kvbm/vllm_integration/connector/dynamo_connector.py, pd_connector.py, lib/kvbm/python/kvbm/vllm_integration/connector_leader.py, connector_worker.py, kv_cache_manager.py, kv_cache_utils.py, rust.py
Migrated imports from dynamo.llm.* to kvbm.* namespace; added conditional CUDA 13 Nixl plugin initialization and dynamic runtime support gating; updated runtime initialization to use optional Arc-wrapped DistributedRuntime.
KVBM Rust Core & Block Manager
lib/kvbm/src/lib.rs, lib/kvbm/src/block_manager.rs, lib/kvbm/src/block_manager/distributed.rs, lib/kvbm/src/block_manager/distributed/utils.rs
Introduced PyO3 module with Tokio/cancellation token initialization; changed BlockManager from mandatory to optional Arc-wrapped DistributedRuntime; added ZMQ URL helpers (get_leader_zmq_pub_url, get_leader_zmq_ack_url); removed barrier ID prefix logic.
KVBM Rust Distributed Leader & Worker
lib/kvbm/src/block_manager/distributed/leader.rs, worker.rs
Removed barrier synchronization and barrier ID prefixes; introduced optional distributed runtime handling; replaced barrier config with ZMQ leader URL endpoints; updated runtime acquisition via get_current_tokio_handle().
KVBM Rust vLLM Connector
lib/kvbm/src/block_manager/vllm/connector/leader.rs, leader/recorder.rs, leader/slot.rs, worker.rs, trtllm_leader.rs, trtllm_worker.rs, lib/kvbm/src/block_manager/vllm.rs
Refactored from mandatory DistributedRuntime to optional Arc-wrapped; removed barrier ID prefixes and replaced with ZMQ leader URLs; updated runtime/token access patterns via get_current_tokio_handle() and get_current_cancel_token(); simplified PyO3 signatures to accept optional PyObject.
Core Library Refactoring (Bindings)
lib/bindings/python/Cargo.toml, lib/bindings/python/rust/lib.rs, lib/bindings/python/rust/llm.rs, lib/bindings/python/src/dynamo/llm/__init__.py
Removed block-manager feature from Cargo.toml and llm module; added to_capsule() method to DistributedRuntime for weak Arc capsule interop; removed conditional BlockManager imports.
Core Library Refactoring (LLM Block Manager & Utils)
lib/bindings/python/rust/llm/block_manager/distributed/utils.rs, lib/llm/Cargo.toml, lib/llm/src/block_manager/distributed.rs, lib/llm/src/block_manager/distributed/...
Removed get_barrier_id_prefix() function; added bincode dependency; removed barrier ID generation in distributed tests; introduced LeaderMetadata/WorkerMetadata structs and ZMQ message constants; replaced barrier synchronization with async gated ping/handshake workflow.
Core Library ZMQ & Scheduling
lib/llm/src/block_manager/distributed/zmq.rs
Replaced single Leader constructor with new_with_handshake<F>() supporting payload collection; added new_leader_sockets() helper; extended broadcast_collect() API; introduced reply() and mark_handled() on MessageHandle; removed ping handler and replaced startup logic with handshake flow.
Core Library State & Backends
lib/llm/src/block_manager/state/resources.rs
Removed UCX backend creation path during NIXL initialization; retained only GDS_MT backend option.
Deployment & Configuration
components/backends/vllm/deploy/disagg_kvbm_2p2d.yaml, disagg_kvbm_tp2.yaml, components/backends/vllm/launch/disagg_kvbm_2p2d.sh
Removed DYN_KVBM_BARRIER_ID_PREFIX environment variable from pod specs; replaced barrier IDs with ZMQ leader ports (56003 for PUB, 56004 for ACK) in launch configuration.
Integration & Documentation
components/src/dynamo/vllm/args.py, docs/kvbm/trtllm-setup.md, docs/kvbm/vllm-setup.md, tests/kvbm_integration/README.md, tests/kvbm_integration/test_determinism_agg.py, lib/kvbm/tests/test_kvbm_vllm_integration.py, pyproject.toml
Updated connector module paths from dynamo.llm.* to kvbm.*; updated integration test paths; added pytest ignore glob for kvbm package.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Rationale: This refactoring involves substantial architectural changes (barrier→ZMQ synchronization, mandatory→optional runtime handling), heterogeneous edits across multiple languages (Rust/Python), new build infrastructure, and significant interdependencies between distributed components. While individual changes follow consistent patterns, the cumulative density of logic changes, new public APIs (zmq.rs handshake, distributed metadata types, PyO3 integration), and refactored runtime handling require careful, multi-faceted analysis.

Poem

🐰 A warren of blocks now stands alone,
No barriers block this path of stone,
With ZMQ winds and capsules bright,
KVBM hops into the light!
From dynamo's grasp, it takes its flight,
⚡✨

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 65.09% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The PR title "chore: KVBM pip wheel" is clear, concise, and directly related to the main objective of the changeset. The title accurately captures the primary purpose—creating a dedicated KVBM pip wheel—which is supported by the extensive changes across multiple files including the new lib/kvbm workspace member, Dockerfile.kvbm_wheel, build scripts, and architectural modifications to support a standalone wheel. While the changeset includes significant internal refactoring (ZMQ messaging, optional DistributedRuntime, dedicated PyO3 bindings), these are all enablers for the pip wheel objective, making the title appropriately focused on the main change. The title is descriptive enough that a teammate scanning history would clearly understand this PR is about creating a KVBM pip wheel infrastructure.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (10)
components/backends/vllm/launch/disagg_kvbm_2p2d.sh (1)

16-16: Update stale comment referencing barrier IDs.

This comment still references "barrier id prefixes," but the implementation now uses ZMQ ports for leader–worker coordination. Update the comment to reflect the new synchronization method.

Apply this diff:

-# NOTE: use different barrier id prefixes for each prefill worker to avoid conflicts
+# NOTE: the first worker is the leader; the second worker uses ZMQ ports to connect
lib/kvbm/tests/test_kvbm_vllm_integration.py (1)

822-822: Mock patch target appears inconsistent with actual imports.

The mock patch decorator still references dynamo.llm.vllm_integration.kv_cache_manager.KvbmCacheManager, but line 30 imports from kvbm.vllm_integration.kv_cache_manager. The mock target should match the actual import path.

Apply this diff to fix the mock target:

-@patch("dynamo.llm.vllm_integration.kv_cache_manager.KvbmCacheManager")
+@patch("kvbm.vllm_integration.kv_cache_manager.KvbmCacheManager")
 def test_kvbm_new_matched_tokens_edge_case(MockCacheManager):
lib/kvbm/src/block_manager.rs (1)

155-179: Don’t panic with unimplemented!; return an error to Python.

unimplemented! will abort the process under load tests. Return a typed PyErr instead.

-            unimplemented!("Leader not provided");
+            return Err(to_pyerr(anyhow::anyhow!(
+                "Leader not provided. Pass KvbmLeader or use BlockManagerBuilder with explicit device/host/disk layout."
+            )));
lib/kvbm/src/block_manager/vllm/connector/leader.rs (2)

151-158: Don’t call tokio::task::block_in_place here

This path is likely executed outside a Tokio runtime; block_in_place will panic. Use Handle::block_on directly.

-        tokio::task::block_in_place(|| {
-            handle.block_on(async {
-                match leader_ready_rx.await {
-                    Ok(_) => tracing::info!("KvConnectorLeader init complete."),
-                    Err(_) => tracing::warn!("KvConnectorLeader init channel dropped"),
-                }
-            });
-        });
+        handle.block_on(async {
+            match leader_ready_rx.await {
+                Ok(_) => tracing::info!("KvConnectorLeader init complete."),
+                Err(_) => tracing::warn!("KvConnectorLeader init channel dropped"),
+            }
+        });

339-352: Avoid panics: replace assert! with error or debug_assert!

assert! will crash production on transient ordering issues. Prefer debug_assert! or return an error/log.

-            assert!(
-                inflight_requests.remove(request_id),
-                "request_id {request_id} not found in inflight_requests: "
-            );
+            if !inflight_requests.remove(request_id) {
+                tracing::warn!("request_id {} not found in inflight_requests", request_id);
+            }

Apply similarly to other assert! sites in this file.

Also applies to: 421-424, 337-341

lib/llm/src/block_manager/distributed/worker.rs (1)

420-446: Require shape length >= 4 for FullyContiguous

shape[3..] on len==3 yields empty product (1), making inner_dim bogus.

-        if shape.len() < 3 {
+        if shape.len() < 4 {
             return Err(anyhow::anyhow!(format!(
                 "Unsupported kv cache layout. Got shape: {:?}",
                 shape
             )));
         }
lib/kvbm/src/block_manager/vllm/connector/worker.rs (2)

140-145: Don’t abort the process on bad input; return an error instead.

assert_eq! will panic in release builds. Validate and return anyhow::Error to keep the interpreter alive.

-        assert_eq!(
-            kv_caches.len(),
-            raw_event_handles.len(),
-            "kv_caches and raw_event_handles must have the same length"
-        );
+        if kv_caches.len() != raw_event_handles.len() {
+            return Err(anyhow::anyhow!(
+                "kv_caches ({}) and raw_event_handles ({}) must have the same length",
+                kv_caches.len(),
+                raw_event_handles.len()
+            ));
+        }

389-396: Replace panics with recoverable errors/logs.

These panic!s can crash the whole process. Prefer logging + early-return or an error.

-                panic!(
-                    "request slot missing for {request_id}; however, it was present when added to the maybe finished offloading set"
-                );
+                tracing::error!(
+                    request_id,
+                    "request slot missing though previously present in maybe_finished_offloading; treating as aborted"
+                );
+                continue;
-                panic!(
-                    "request slot missing for {request_id}; however, it was present when added to the maybe finished onboarding set"
-                );
+                tracing::error!(
+                    request_id,
+                    "request slot missing though previously present in maybe_finished_onboarding; treating as aborted"
+                );
+                continue;

Also applies to: 423-426

lib/kvbm/src/block_manager/vllm/connector/trtllm_worker.rs (1)

329-332: Avoid crashing on missing request slots.

Replace panic! with log + continue to keep service alive.

-                panic!(
-                    "request slot missing for {request_id}; however, it was present when added to the maybe finished offloading set"
-                );
+                tracing::error!(
+                    request_id,
+                    "request slot missing although previously tracked for offloading; treating as aborted"
+                );
+                continue;
-                panic!(
-                    "request slot missing for {request_id}; however, it was present when added to the maybe finished onboarding set"
-                );
+                tracing::error!(
+                    request_id,
+                    "request slot missing although previously tracked for onboarding; treating as aborted"
+                );
+                continue;

Also applies to: 361-364

lib/llm/src/block_manager/distributed/zmq.rs (1)

231-280: Memory leak: pending_messages entries from broadcast() are never removed.

Entries live forever because pull_worker doesn’t remove them for pure-ACK messages. Remove the entry when all ACKs are received and want_payload == false.

@@
     pub async fn broadcast(
@@
-        let pending_message = PendingMessage {
+        let pending_message = PendingMessage {
             // We start with the number of workers we're waiting for.
             remaining_workers: *self.num_workers,
-            completion_indicator: Some(completion_indicator),
-            want_payload: false,
-            payloads: None,
+            completion_indicator: Some(completion_indicator),
+            want_payload: false,
+            payloads: None,
         };
@@
     async fn pull_worker(
         mut pull_socket: Pull,
         pending_messages: Arc<Mutex<HashMap<usize, PendingMessage>>>,
         cancel_token: CancellationToken,
     ) -> Result<()> {
@@
-                let mut map = pending_messages.lock().await;
-
-                if let Some(pm) = map.get_mut(&id) {
+                let mut map = pending_messages.lock().await;
+                // Track whether we should remove after we drop the mutable borrow.
+                let mut should_remove = false;
+                if let Some(pm) = map.get_mut(&id) {
@@
-                    if pm.remaining_workers == 0
-                        && let Some(tx) = pm.completion_indicator.take() {
-                            let _ = tx.send(());
-                        }
+                    if pm.remaining_workers == 0 {
+                        if let Some(tx) = pm.completion_indicator.take() {
+                            let _ = tx.send(());
+                        }
+                        // For pure-ACK messages we can remove immediately here to avoid leaks.
+                        if !pm.want_payload {
+                            should_remove = true;
+                        }
+                    }
                 } else {
                     // Late reply for a round we've already collected/removed.
                     tracing::debug!("Leader PULL: late/unknown id {}", id);
                 }
+                if should_remove {
+                    map.remove(&id);
+                }

Also applies to: 344-396

🧹 Nitpick comments (30)
components/backends/vllm/launch/disagg_kvbm_2p2d.sh (1)

26-27: Parameterize ZMQ ports to support runtime configuration in multi-instance deployments.

The shell script currently sets ports to hardcoded values (56003, 56004), preventing override via environment variables without editing the script. While the KVBM implementation already supports environment variable configuration with validation (via validated_port_from_env() in lib/kvbm/src/block_manager/distributed/utils.rs), the shell script doesn't expose this flexibility.

Apply the suggested diff to allow runtime overrides:

-DYN_KVBM_LEADER_ZMQ_PUB_PORT=56003 \
-DYN_KVBM_LEADER_ZMQ_ACK_PORT=56004 \
+DYN_KVBM_LEADER_ZMQ_PUB_PORT=${DYN_KVBM_LEADER_ZMQ_PUB_PORT:-56003} \
+DYN_KVBM_LEADER_ZMQ_ACK_PORT=${DYN_KVBM_LEADER_ZMQ_ACK_PORT:-56004} \

This preserves current behavior (56003/56004 defaults) while enabling users to override ports before script execution, resolving conflicts in containerized or multi-instance scenarios.

lib/kvbm/python/kvbm/vllm_integration/connector/dynamo_connector.py (1)

29-29: Consider removing commented import.

The commented import on line 29 appears to be unused. Consider removing it rather than leaving commented code, as version control already preserves the history.

-# from kvbm.vllm_integration.kv_cache_utils import KvbmCacheBlocks
 from kvbm.vllm_integration.connector_leader import KvConnectorLeader
 from kvbm.vllm_integration.connector_worker import KvConnectorWorker
lib/kvbm/python/kvbm/vllm_integration/rust.py (1)

25-36: Use warnings and include the exception for easier diagnostics.

Printing to stdout hides the root cause and is noisy. Prefer warnings.warn and show the exception.

-except ImportError:
-    print("Failed to import Dynamo KVBM. vLLM integration will not be available.")
+import warnings
+except ImportError as e:
+    warnings.warn(f"Failed to import KVBM; vLLM integration unavailable: {e}")
container/build_kvbm_wheel.sh (3)

56-60: Harden docker build: pull latest base and disambiguate tag by arch.

Prevents stale base images and tag collisions across architectures.

-docker build \
-    ${BUILD_ARGS} \
-    -t kvbm-wheel:tmp \
-    -f container/Dockerfile.kvbm_wheel .
+docker build \
+    --pull \
+    ${BUILD_ARGS} \
+    -t kvbm-wheel:${ARCH}-tmp \
+    -f container/Dockerfile.kvbm_wheel .

52-55: Also remove the temporary image in cleanup.

Containers are removed; the image remains and can pile up.

-cid=""
-cleanup() { [[ -n "${cid}" ]] && docker rm -v "$cid" >/dev/null 2>&1 || true; }
+cid=""
+cleanup() {
+  [[ -n "${cid}" ]] && docker rm -v "$cid" >/dev/null 2>&1 || true
+  docker image rm -f kvbm-wheel:${ARCH}-tmp >/dev/null 2>&1 || true
+}

61-65: Fail fast if no wheel artifacts were produced.

Avoids silently succeeding with an empty output directory.

 mkdir -p "$OUTPUT_DIR"
-docker cp "$cid":/opt/dynamo/dist/. "$OUTPUT_DIR"/
+docker cp "$cid":/opt/dynamo/dist/. "$OUTPUT_DIR"/
+if ! ls "$OUTPUT_DIR"/kvbm*.whl >/dev/null 2>&1; then
+  echo "No kvbm wheel found in $OUTPUT_DIR"; exit 1
+fi
container/Dockerfile (1)

385-395: build.sh correctly passes kvbm_wheel build-context — primary concern verified and satisfied.

The kvbm_wheel build-context is properly configured in build.sh (line 734): BUILD_CONTEXT_ARG+=" --build-context kvbm_wheel=${KVBM_PIP_WHEEL_DIR}". The context is unconditionally created to satisfy Docker's build-context requirement, with an empty directory used when the wheel is not built.

However, the suggested refactor for guarding missing files remains a reasonable defensive programming enhancement. When ENABLE_KVBM=true is passed to the Dockerfile but the wheel fails to build in build.sh, the glob expansion kvbm*.whl will match nothing, causing cp to silently behave unexpectedly. Adding explicit file existence validation (as suggested in the refactor) would catch this scenario and fail fast with a clear error message.

Recommendation: The build-context concern is resolved. Apply the suggested refactor for improved error handling and clarity:

 RUN --mount=type=bind,from=kvbm_wheel,source=.,target=/mnt/kvbm_wheel,ro \
     set -e; \
     if [ "${ENABLE_KVBM}" = "true" ]; then \
       mkdir -p /opt/dynamo/wheelhouse; \
-      cp -v /mnt/kvbm_wheel/kvbm*.whl /opt/dynamo/wheelhouse/; \
+      shopt -s nullglob; files=(/mnt/kvbm_wheel/kvbm*.whl); \
+      if [ ${#files[@]} -eq 0 ]; then echo "No KVBM wheel in build-context kvbm_wheel"; exit 1; fi; \
+      cp -v "${files[@]}" /opt/dynamo/wheelhouse/; \
       uv pip install /opt/dynamo/wheelhouse/kvbm*.whl; \
     fi
lib/kvbm/python/kvbm/vllm_integration/connector_leader.py (4)

37-39: Validate DistributedRuntime import path.

Elsewhere the type lives under dynamo._core.DistributedRuntime. Confirm that dynamo.runtime.DistributedRuntime is the intended public path, or import from dynamo._core to avoid ImportError when enabled.

Consider lazy-importing inside init to avoid import-time crashes when the env flag is toggled. Based on learnings.


71-74: Avoid print in library path; use logging.

Replace print with logging to respect caller logging config.

-        print(f"KvConnectorLeader initialized with engine_id: {engine_id}")
+        import logging
+        logging.getLogger(__name__).info(
+            "KvConnectorLeader initialized: engine_id=%s", engine_id
+        )

166-170: Double‑check scheduler token assertion fields.

You set output.add_num_scheduled_tokens(scheduler_output.num_scheduled_tokens) but assert equality with scheduler_output.total_num_scheduled_tokens. Ensure these semantics match; otherwise the assertion can trip.

If mismatch, assert against num_scheduled_tokens or set output using total_num_scheduled_tokens for consistency.


195-217: Avoid parameter shadowing; rename local request.

Rebinding request to KvbmRequest harms readability and future edits.

-        request = KvbmRequest(
+        kvbm_req = KvbmRequest(
             request_id=request.request_id,
             lora_name=request.lora_request.lora_name()
             if request.lora_request
             else None,
             salt_hash=request.cache_salt,
         )
-
-        self._connector.create_slot(request, all_token_ids)
+        self._connector.create_slot(kvbm_req, all_token_ids)
lib/kvbm/src/block_manager/distributed/utils.rs (1)

50-64: Optional: avoid repeated env reads and allocate fewer Strings.

Cache host once and return ports as u16 to format at callsite; reduces tiny overhead.

-pub fn get_leader_zmq_pub_url() -> String {
-    format!(
-        "tcp://{}:{}",
-        get_leader_zmq_host(),
-        get_leader_zmq_pub_port()
-    )
-}
+pub fn get_leader_zmq_pub_url() -> String {
+    let host = get_leader_zmq_host();
+    format!("tcp://{}:{}", host, get_leader_zmq_pub_port())
+}

(Same tweak for get_leader_zmq_ack_url.)

container/Dockerfile.kvbm_wheel (2)

152-153: BuildKit assumption: verify remote stage COPY works in CI.

COPY --from=ghcr.io/astral-sh/uv:latest requires BuildKit. Ensure buildx/BuildKit is enabled in CI runners.

If not guaranteed, add an explicit first stage:

FROM ghcr.io/astral-sh/uv:latest as uv
...
COPY --from=uv /uv /uvx /bin/

276-282: Wheel bloat and licensing risk from vendoring full Nixl tree.

Copying /usr/local/nixl/lib64/* into kvbm/nixl-cu13 may bundle unused/ABI-variant .so’s and plugins. Prefer minimal set and let auditwheel/maturin handle delocation.

  • Package only required shared libs and plugin subset; avoid static dev files.
  • Consider separate nixl-cu13 wheel dependency once available (as noted in TODO) and remove manual vendoring.

Track wheel size in CI and fail if it exceeds a threshold (e.g., >200MB).

lib/kvbm/src/block_manager.rs (2)

47-73: Disk offload filter: make env parsing case‑insensitive and explicit.

Current check only accepts "true"/"1". Consider normalize and support "TRUE/True/yes".

-    let disable_filter = std::env::var("DYN_KVBM_DISABLE_DISK_OFFLOAD_FILTER")
-        .map(|v| v == "true" || v == "1")
+    let disable_filter = std::env::var("DYN_KVBM_DISABLE_DISK_OFFLOAD_FILTER")
+        .map(|v| {
+            let v = v.trim().to_ascii_lowercase();
+            v == "true" || v == "1" || v == "yes"
+        })
         .unwrap_or(false);

291-295: Stale error message.

“runtime is always taken from leader” no longer holds with get_current_tokio_handle(). Update to reflect current behavior.

-            anyhow::anyhow!("leader is required (runtime is always taken from leader)")
+            anyhow::anyhow!("leader is required to derive layouts; runtime comes from the global Tokio handle")
lib/kvbm/src/lib.rs (1)

81-86: Duplicate to_pyerr across crates

to_pyerr here duplicates lib/bindings/python/rust/lib.rs. Prefer a shared helper to avoid divergence.

Consider moving to_pyerr into a common crate (e.g., dynamo-runtime or a small shared utils crate) and reuse.

lib/kvbm/Cargo.toml (2)

42-46: Pin 0.x dependencies to a minor version to avoid accidental breaking updates

"0" allows any 0.x which can be breaking per semver. Recommend pinning.

-tokio-stream = { version = "0" }
-tracing = { version = "0"  }
+tokio-stream = { version = "0.1" }
+tracing = { version = "0.1" }

Consider similar pins for futures/either/thiserror if needed.


25-28: Add rust-version and resolver to ensure reproducible builds

Explicit MSRV and Cargo resolver help CI/users.

 [package]
 name = "kvbm-py3"
 version = "0.1.0"
 edition = "2024"
+rust-version = "1.80" # or the required version for edition 2024 once verified
@@
 [lib]
@@
 crate-type = ["cdylib", "rlib"]
+
+[workspace]
+resolver = "2"

Adjust rust-version after verification.

lib/kvbm/src/block_manager/vllm/connector/leader.rs (2)

545-562: Clarify unused drt parameter

drt is intentionally unused for leader. Keep for API stability but silence lint explicitly.

-        let _ = &drt; // drt is currently un-used in leader
+        let _ = &drt; // API compatibility; leader path does not require drt
+        #[allow(unused_variables)]
+        let drt_for_compat = &drt;

Alternatively, add #[allow(unused_variables)] on the function.


105-147: Use oneshot<()> for readiness signal

The payload is never used; use unit to reduce allocations.

-        let (leader_ready_tx, leader_ready_rx) = oneshot::channel::<String>();
+        let (leader_ready_tx, leader_ready_rx) = oneshot::channel::<()>();
@@
-                if leader_ready_tx.send("finished".to_string()).is_err() {
+                if leader_ready_tx.send(()).is_err() {
                     tracing::error!("main routine receiver dropped before result was sent");
                 }
lib/llm/src/block_manager/distributed/worker.rs (2)

35-51: Relax atomic ordering for WorkerState

SeqCst is likely overkill; Acquire on read, Release on write suffices for a readiness flag.

-    fn mark_ready(&self) {
-        self.ready_for_ping.store(true, Ordering::SeqCst);
-    }
+    fn mark_ready(&self) {
+        self.ready_for_ping.store(true, Ordering::Release);
+    }
-    fn is_ready(&self) -> bool {
-        self.ready_for_ping.load(Ordering::SeqCst)
-    }
+    fn is_ready(&self) -> bool {
+        self.ready_for_ping.load(Ordering::Acquire)
+    }

579-639: Non-blocking init: ensure background worker join errors are surfaced

The spawned worker_task error only logs; consider propagating via a channel so callers can fail fast.

Add an oneshot<Result<()>> to forward worker_task completion errors to the orchestrator and expose via builder or metrics.

lib/kvbm/src/block_manager/distributed/leader.rs (1)

72-82: Optional DistributedRuntime extraction: good, but add a hint in error

If instance check fails, include repr/type to aid debugging.

-        return Err(PyTypeError::new_err(
-            "expected dynamo._core.DistributedRuntime",
-        ));
+        return Err(PyTypeError::new_err(format!(
+            "expected dynamo._core.DistributedRuntime; got {}",
+            obj.get_type().name()?
+        )));
lib/kvbm/src/block_manager/vllm/connector/worker.rs (1)

449-457: Make py_drt optional at the Python call site.

Expose a default of None to avoid forcing callers to pass an argument.

-    #[pyo3(signature = (py_drt, vllm_worker_id))]
-    pub fn new(py_drt: Option<PyObject>, vllm_worker_id: String) -> PyResult<Self> {
+    #[pyo3(signature = (py_drt=None, vllm_worker_id))]
+    pub fn new(py_drt: Option<PyObject>, vllm_worker_id: String) -> PyResult<Self> {
lib/llm/src/block_manager/distributed/leader.rs (2)

27-36: Use GiB (1<<30) or document decimal GB explicitly.

Current math uses 1,000,000,000 bytes/GB. If caller expects GiB, allocations will be off. Either switch to 1_073_741_824 or clarify env var units.

-        ((num_blocks_config.cache_size_in_gb * 1_000_000_000.0) / bytes_per_block as f64) as usize
+        ((num_blocks_config.cache_size_in_gb * 1_073_741_824.0) / bytes_per_block as f64) as usize

127-131: Keep a cancel token to shut down background tasks on drop.

spawn_zmq_task uses a token that isn’t retained. Store it in the struct and cancel in Drop to prevent orphaned tasks.

lib/kvbm/python/kvbm/_core.pyi (2)

112-151: Expose block_size() in the stub.

Used by KvbmCacheManager; missing here.

 class BlockManager:
@@
     ) -> None:
         """
         Create a `BlockManager` object
@@
         """
         ...
+    def block_size(self) -> int: ...

32-36: Doc nit: it’s “in the block,” not “in the list.”

-        Get the number of layers in the list
+        Get the number of layers in the block
lib/kvbm/src/block_manager/vllm/connector/trtllm_worker.rs (1)

397-405: Make py_drt optional at the Python call site.

-    #[pyo3(signature = (py_drt, trtllm_rank))]
-    pub fn new(py_drt: Option<PyObject>, trtllm_rank: String) -> PyResult<Self> {
+    #[pyo3(signature = (py_drt=None, trtllm_rank))]
+    pub fn new(py_drt: Option<PyObject>, trtllm_rank: String) -> PyResult<Self> {
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f93b619 and 0fa5804.

⛔ Files ignored due to path filters (2)
  • Cargo.lock is excluded by !**/*.lock
  • lib/bindings/python/Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (61)
  • Cargo.toml (1 hunks)
  • components/backends/vllm/deploy/disagg_kvbm_2p2d.yaml (0 hunks)
  • components/backends/vllm/deploy/disagg_kvbm_tp2.yaml (0 hunks)
  • components/backends/vllm/launch/disagg_kvbm_2p2d.sh (1 hunks)
  • components/src/dynamo/vllm/args.py (2 hunks)
  • container/Dockerfile (3 hunks)
  • container/Dockerfile.kvbm_wheel (1 hunks)
  • container/Dockerfile.trtllm (1 hunks)
  • container/Dockerfile.vllm (1 hunks)
  • container/build.sh (4 hunks)
  • container/build_kvbm_wheel.sh (1 hunks)
  • container/deps/kvbm/install_cuda13.sh (1 hunks)
  • container/deps/trtllm/install_nixl.sh (1 hunks)
  • docs/kvbm/trtllm-setup.md (1 hunks)
  • docs/kvbm/vllm-setup.md (1 hunks)
  • lib/bindings/python/Cargo.toml (0 hunks)
  • lib/bindings/python/rust/lib.rs (2 hunks)
  • lib/bindings/python/rust/llm.rs (0 hunks)
  • lib/bindings/python/rust/llm/block_manager/distributed/utils.rs (0 hunks)
  • lib/bindings/python/src/dynamo/llm/__init__.py (0 hunks)
  • lib/kvbm/Cargo.toml (1 hunks)
  • lib/kvbm/LICENSE (1 hunks)
  • lib/kvbm/README.md (1 hunks)
  • lib/kvbm/pyproject.toml (1 hunks)
  • lib/kvbm/python/kvbm/__init__.py (1 hunks)
  • lib/kvbm/python/kvbm/_core.pyi (1 hunks)
  • lib/kvbm/python/kvbm/trtllm_integration/connector/kvbm_connector_leader.py (1 hunks)
  • lib/kvbm/python/kvbm/trtllm_integration/connector/kvbm_connector_worker.py (1 hunks)
  • lib/kvbm/python/kvbm/trtllm_integration/rust.py (1 hunks)
  • lib/kvbm/python/kvbm/utils.py (1 hunks)
  • lib/kvbm/python/kvbm/vllm_integration/connector/dynamo_connector.py (1 hunks)
  • lib/kvbm/python/kvbm/vllm_integration/connector/pd_connector.py (1 hunks)
  • lib/kvbm/python/kvbm/vllm_integration/connector_leader.py (2 hunks)
  • lib/kvbm/python/kvbm/vllm_integration/connector_worker.py (3 hunks)
  • lib/kvbm/python/kvbm/vllm_integration/kv_cache_manager.py (1 hunks)
  • lib/kvbm/python/kvbm/vllm_integration/kv_cache_utils.py (1 hunks)
  • lib/kvbm/python/kvbm/vllm_integration/rust.py (2 hunks)
  • lib/kvbm/src/block_manager.rs (9 hunks)
  • lib/kvbm/src/block_manager/distributed.rs (1 hunks)
  • lib/kvbm/src/block_manager/distributed/leader.rs (3 hunks)
  • lib/kvbm/src/block_manager/distributed/utils.rs (1 hunks)
  • lib/kvbm/src/block_manager/distributed/worker.rs (6 hunks)
  • lib/kvbm/src/block_manager/vllm.rs (1 hunks)
  • lib/kvbm/src/block_manager/vllm/connector/leader.rs (3 hunks)
  • lib/kvbm/src/block_manager/vllm/connector/leader/recorder.rs (2 hunks)
  • lib/kvbm/src/block_manager/vllm/connector/leader/slot.rs (5 hunks)
  • lib/kvbm/src/block_manager/vllm/connector/trtllm_leader.rs (3 hunks)
  • lib/kvbm/src/block_manager/vllm/connector/trtllm_worker.rs (7 hunks)
  • lib/kvbm/src/block_manager/vllm/connector/worker.rs (6 hunks)
  • lib/kvbm/src/lib.rs (1 hunks)
  • lib/kvbm/tests/test_kvbm_vllm_integration.py (1 hunks)
  • lib/llm/Cargo.toml (1 hunks)
  • lib/llm/src/block_manager/distributed.rs (0 hunks)
  • lib/llm/src/block_manager/distributed/leader.rs (5 hunks)
  • lib/llm/src/block_manager/distributed/utils.rs (1 hunks)
  • lib/llm/src/block_manager/distributed/worker.rs (10 hunks)
  • lib/llm/src/block_manager/distributed/zmq.rs (10 hunks)
  • lib/llm/src/block_manager/state/resources.rs (0 hunks)
  • pyproject.toml (1 hunks)
  • tests/kvbm_integration/README.md (2 hunks)
  • tests/kvbm_integration/test_determinism_agg.py (2 hunks)
💤 Files with no reviewable changes (8)
  • lib/bindings/python/rust/llm.rs
  • lib/llm/src/block_manager/distributed.rs
  • components/backends/vllm/deploy/disagg_kvbm_tp2.yaml
  • lib/llm/src/block_manager/state/resources.rs
  • lib/bindings/python/rust/llm/block_manager/distributed/utils.rs
  • lib/bindings/python/Cargo.toml
  • components/backends/vllm/deploy/disagg_kvbm_2p2d.yaml
  • lib/bindings/python/src/dynamo/llm/init.py
🧰 Additional context used
🧬 Code graph analysis (29)
lib/kvbm/python/kvbm/vllm_integration/kv_cache_manager.py (2)
lib/kvbm/python/kvbm/vllm_integration/kv_cache_utils.py (1)
  • KvbmCacheBlocks (20-64)
lib/kvbm/python/kvbm/_core.pyi (3)
  • BlockManager (112-214)
  • KvbmCacheManager (216-222)
  • KvbmRequest (225-231)
lib/kvbm/python/kvbm/vllm_integration/connector/dynamo_connector.py (2)
lib/kvbm/python/kvbm/vllm_integration/connector_leader.py (1)
  • KvConnectorLeader (48-217)
lib/kvbm/python/kvbm/vllm_integration/connector_worker.py (1)
  • KvConnectorWorker (52-212)
lib/kvbm/python/kvbm/vllm_integration/rust.py (3)
lib/kvbm/src/lib.rs (1)
  • _core (22-30)
lib/kvbm/src/block_manager/vllm.rs (1)
  • _vllm_integration (41-57)
lib/kvbm/python/kvbm/_core.pyi (1)
  • BlockManager (112-214)
lib/kvbm/src/block_manager/vllm/connector/leader/recorder.rs (2)
lib/kvbm/src/block_manager/vllm/connector/leader.rs (2)
  • new (89-168)
  • new (543-562)
lib/kvbm/src/lib.rs (1)
  • get_current_tokio_handle (66-72)
lib/bindings/python/rust/lib.rs (2)
lib/runtime/src/distributed.rs (1)
  • fmt (37-39)
lib/bindings/python/src/dynamo/_core.pyi (1)
  • DistributedRuntime (35-65)
lib/kvbm/python/kvbm/trtllm_integration/connector/kvbm_connector_leader.py (5)
lib/kvbm/python/kvbm/_core.pyi (1)
  • KvbmRequest (225-231)
lib/kvbm/python/kvbm/vllm_integration/connector_leader.py (1)
  • KvConnectorLeader (48-217)
lib/kvbm/python/kvbm/utils.py (1)
  • is_dyn_runtime_enabled (73-85)
lib/bindings/python/src/dynamo/_core.pyi (1)
  • DistributedRuntime (35-65)
lib/bindings/python/rust/lib.rs (1)
  • detached (467-479)
lib/kvbm/src/block_manager/distributed/utils.rs (2)
lib/kvbm/src/block_manager/distributed/leader.rs (3)
  • std (22-22)
  • std (26-28)
  • v (53-53)
lib/kvbm/src/block_manager/vllm/connector/leader.rs (1)
  • val (617-617)
lib/kvbm/src/block_manager/vllm.rs (1)
lib/kvbm/python/kvbm/_core.pyi (1)
  • BlockManager (112-214)
lib/kvbm/src/block_manager/distributed.rs (1)
lib/kvbm/src/block_manager/distributed/utils.rs (2)
  • get_leader_zmq_ack_url (58-64)
  • get_leader_zmq_pub_url (50-56)
lib/llm/src/block_manager/distributed/leader.rs (4)
lib/kvbm/src/block_manager/distributed/leader.rs (3)
  • new (74-107)
  • std (22-22)
  • std (26-28)
lib/kvbm/src/block_manager/distributed/worker.rs (2)
  • new (64-93)
  • new (147-214)
lib/kvbm/src/block_manager/vllm/connector/leader.rs (2)
  • new (89-168)
  • new (543-562)
lib/llm/src/block_manager/distributed/zmq.rs (3)
  • new_leader_sockets (45-67)
  • new_with_handshake (87-229)
  • std (418-418)
lib/kvbm/python/kvbm/vllm_integration/connector_worker.py (3)
lib/kvbm/python/kvbm/utils.py (3)
  • is_cuda_13 (13-45)
  • is_dyn_runtime_enabled (73-85)
  • set_cu13_nixl_plugin_path (51-70)
lib/bindings/python/src/dynamo/_core.pyi (1)
  • DistributedRuntime (35-65)
lib/bindings/python/rust/lib.rs (1)
  • detached (467-479)
lib/kvbm/python/kvbm/vllm_integration/connector_leader.py (3)
lib/kvbm/python/kvbm/utils.py (1)
  • is_dyn_runtime_enabled (73-85)
lib/bindings/python/src/dynamo/_core.pyi (1)
  • DistributedRuntime (35-65)
lib/bindings/python/rust/lib.rs (1)
  • detached (467-479)
lib/kvbm/python/kvbm/__init__.py (3)
lib/kvbm/src/lib.rs (1)
  • _core (22-30)
lib/kvbm/python/kvbm/_core.pyi (1)
  • BlockManager (112-214)
lib/llm/src/block_manager/distributed/worker.rs (3)
  • KvbmWorker (131-131)
  • KvbmWorker (144-144)
  • KvbmWorker (160-160)
lib/kvbm/python/kvbm/trtllm_integration/rust.py (2)
lib/kvbm/src/lib.rs (1)
  • _core (22-30)
lib/kvbm/src/block_manager/vllm.rs (1)
  • _vllm_integration (41-57)
lib/llm/src/block_manager/distributed/worker.rs (4)
lib/kvbm/src/block_manager/distributed/leader.rs (1)
  • new (74-107)
lib/llm/src/block_manager/distributed/zmq.rs (2)
  • handle (500-500)
  • bincode (171-171)
lib/kvbm/src/block_manager/distributed/worker.rs (2)
  • new (64-93)
  • new (147-214)
lib/llm/src/block_manager/distributed/transfer.rs (1)
  • handle (216-264)
lib/kvbm/src/block_manager/vllm/connector/leader/slot.rs (2)
lib/kvbm/src/lib.rs (2)
  • get_current_cancel_token (74-79)
  • get_current_tokio_handle (66-72)
lib/runtime/src/utils/tasks/critical.rs (1)
  • new_with_runtime (56-156)
lib/kvbm/python/kvbm/vllm_integration/connector/pd_connector.py (1)
lib/kvbm/python/kvbm/vllm_integration/connector/dynamo_connector.py (1)
  • DynamoConnector (42-127)
lib/kvbm/python/kvbm/trtllm_integration/connector/kvbm_connector_worker.py (4)
lib/kvbm/python/kvbm/utils.py (3)
  • is_cuda_13 (13-45)
  • is_dyn_runtime_enabled (73-85)
  • set_cu13_nixl_plugin_path (51-70)
lib/kvbm/python/kvbm/vllm_integration/connector_worker.py (1)
  • KvConnectorWorker (52-212)
lib/bindings/python/src/dynamo/_core.pyi (1)
  • DistributedRuntime (35-65)
lib/bindings/python/rust/lib.rs (1)
  • detached (467-479)
lib/kvbm/src/block_manager/vllm/connector/trtllm_leader.rs (3)
lib/kvbm/src/block_manager/vllm/connector/leader.rs (4)
  • kvbm_metrics_endpoint_enabled (609-613)
  • parse_kvbm_metrics_port (615-634)
  • new (89-168)
  • new (543-562)
lib/kvbm/src/lib.rs (1)
  • get_current_tokio_handle (66-72)
lib/kvbm/src/block_manager/vllm/connector/leader/recorder.rs (1)
  • new (88-187)
lib/kvbm/src/block_manager/vllm/connector/leader.rs (4)
lib/kvbm/src/block_manager.rs (5)
  • leader (272-275)
  • new (89-200)
  • new (257-262)
  • worker_id (264-267)
  • page_size (268-271)
lib/kvbm/src/lib.rs (1)
  • get_current_tokio_handle (66-72)
lib/kvbm/src/block_manager/vllm/connector/leader/recorder.rs (1)
  • new (88-187)
lib/kvbm/src/block_manager/vllm/connector/trtllm_leader.rs (2)
  • new (66-135)
  • new (442-452)
lib/kvbm/src/block_manager/distributed/leader.rs (2)
lib/kvbm/src/block_manager/distributed/utils.rs (2)
  • get_leader_zmq_ack_url (58-64)
  • get_leader_zmq_pub_url (50-56)
lib/kvbm/src/lib.rs (2)
  • extract_distributed_runtime_from_obj (94-120)
  • get_current_tokio_handle (66-72)
container/build.sh (1)
container/run.sh (2)
  • missing_requirement (345-347)
  • error (349-352)
lib/kvbm/src/block_manager/vllm/connector/trtllm_worker.rs (4)
lib/kvbm/src/block_manager/distributed/utils.rs (2)
  • get_leader_zmq_ack_url (58-64)
  • get_leader_zmq_pub_url (50-56)
lib/bindings/python/rust/lib.rs (3)
  • to_pyerr (200-205)
  • new (428-464)
  • new (1105-1109)
lib/kvbm/src/lib.rs (4)
  • to_pyerr (81-86)
  • extract_distributed_runtime_from_obj (94-120)
  • get_current_cancel_token (74-79)
  • get_current_tokio_handle (66-72)
lib/kvbm/src/block_manager/distributed/worker.rs (2)
  • new (64-93)
  • new (147-214)
lib/kvbm/src/block_manager/vllm/connector/worker.rs (3)
lib/kvbm/src/block_manager/distributed/utils.rs (2)
  • get_leader_zmq_ack_url (58-64)
  • get_leader_zmq_pub_url (50-56)
lib/bindings/python/rust/lib.rs (3)
  • to_pyerr (200-205)
  • new (428-464)
  • new (1105-1109)
lib/kvbm/src/lib.rs (4)
  • to_pyerr (81-86)
  • extract_distributed_runtime_from_obj (94-120)
  • get_current_cancel_token (74-79)
  • get_current_tokio_handle (66-72)
lib/llm/src/block_manager/distributed/zmq.rs (2)
lib/kvbm/src/block_manager/distributed/worker.rs (2)
  • new (64-93)
  • new (147-214)
lib/llm/src/block_manager/distributed/leader.rs (1)
  • new (118-131)
lib/kvbm/src/lib.rs (3)
lib/kvbm/src/block_manager/distributed/leader.rs (3)
  • std (22-22)
  • std (26-28)
  • new (74-107)
lib/bindings/python/rust/lib.rs (29)
  • m (145-145)
  • m (146-146)
  • m (147-147)
  • m (148-148)
  • m (149-149)
  • m (150-150)
  • m (151-151)
  • m (152-152)
  • m (153-153)
  • m (154-154)
  • m (155-155)
  • m (156-156)
  • m (157-157)
  • m (158-158)
  • m (159-159)
  • m (160-160)
  • new (428-464)
  • new (1105-1109)
  • to_pyerr (200-205)
  • PyErr (237-237)
  • PyErr (242-242)
  • PyErr (266-266)
  • PyErr (278-278)
  • PyErr (502-502)
  • PyErr (508-508)
  • PyErr (519-519)
  • PyErr (534-534)
  • PyErr (614-614)
  • PyErr (1074-1074)
lib/kvbm/src/block_manager.rs (2)
  • new (89-200)
  • new (257-262)
lib/kvbm/src/block_manager.rs (2)
lib/bindings/python/src/dynamo/_core.pyi (2)
  • DistributedRuntime (35-65)
  • CancellationToken (67-78)
lib/kvbm/src/lib.rs (1)
  • get_current_tokio_handle (66-72)
lib/kvbm/python/kvbm/_core.pyi (5)
lib/kvbm/src/block_manager/block_list.rs (1)
  • to_list (41-48)
lib/kvbm/src/block_manager/block.rs (1)
  • to_list (68-75)
lib/kvbm/src/block_manager.rs (3)
  • worker_id (264-267)
  • page_size (268-271)
  • block_size (202-204)
lib/kvbm/src/block_manager/dlpack.rs (1)
  • dtype (46-65)
lib/kvbm/python/kvbm/vllm_integration/kv_cache_manager.py (1)
  • KvbmCacheManager (35-416)
lib/kvbm/src/block_manager/distributed/worker.rs (2)
lib/kvbm/src/block_manager/distributed/utils.rs (2)
  • get_leader_zmq_ack_url (58-64)
  • get_leader_zmq_pub_url (50-56)
lib/kvbm/src/lib.rs (3)
  • extract_distributed_runtime_from_obj (94-120)
  • get_current_tokio_handle (66-72)
  • get_current_cancel_token (74-79)
🪛 Ruff (0.14.1)
lib/kvbm/python/kvbm/utils.py

29-30: try-except-pass detected, consider logging the exception

(S110)


29-29: Do not catch blind exception: Exception

(BLE001)


42-43: try-except-pass detected, consider logging the exception

(S110)


42-42: Do not catch blind exception: Exception

(BLE001)


48-48: Comment contains ambiguous (RIGHT SINGLE QUOTATION MARK). Did you mean ``` (GRAVE ACCENT)?

(RUF003)


65-65: Do not catch blind exception: Exception

(BLE001)

lib/kvbm/python/kvbm/vllm_integration/connector_worker.py

17-17: Comment contains ambiguous (RIGHT SINGLE QUOTATION MARK). Did you mean ``` (GRAVE ACCENT)?

(RUF003)


19-19: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)

lib/kvbm/python/kvbm/trtllm_integration/connector/kvbm_connector_worker.py

11-11: Comment contains ambiguous (RIGHT SINGLE QUOTATION MARK). Did you mean ``` (GRAVE ACCENT)?

(RUF003)


13-13: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)

🪛 Shellcheck (0.11.0)
container/deps/kvbm/install_cuda13.sh

[warning] 48-48: NV_LIBNCCL_VERSION appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 49-49: NCCL_VERSION appears unused. Verify use (or export if used externally).

(SC2034)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
  • GitHub Check: trtllm (amd64)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: vllm (arm64)
  • GitHub Check: vllm (amd64)
  • GitHub Check: sglang
  • GitHub Check: clippy (launch/dynamo-run)
  • GitHub Check: clippy (lib/runtime/examples)
  • GitHub Check: tests (launch/dynamo-run)
  • GitHub Check: clippy (.)
  • GitHub Check: tests (lib/runtime/examples)
  • GitHub Check: Build and Test - dynamo
  • GitHub Check: tests (lib/bindings/python)
  • GitHub Check: tests (.)


def is_dyn_runtime_enabled() -> bool:
"""
Return True if DYN_RUNTIME_ENABLED_KVBM is set to '1' or 'true' (case-insensitive).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the use case to set DYN_RUNTIME_ENABLED_KVBM to true?

let's update the runbooks accordingly for those use cases

Copy link
Contributor Author

@richardhuo-nv richardhuo-nv Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is for when we would like to use DistributedRuntime in KVBM.
Ideally, we don't need this, and we should simply do a fallback when DistributedRuntime is not available

try:
     DistributedRuntime.detached()
except:
     # no DRT code...

But DistributedRuntime.detached() will crash the entire process when nats and etcd are not availiable, and will not raise a catch-able python exception.
So setting this ENV VAR to give an option when someone really want to try out DistributedRuntime.

I don't really feel we have a use case that depends on DistributedRuntime. Adding this to runbooks might confuse a lot of users if they don't have enough background of distributed runtime. This is just for having DRT as an option and keeping a soft connection between kvbm and dynamo-runtime in rust.

@ziqifan617
Copy link
Contributor

Changes in KVBM:

  1. Removed the ETCD-based leader–worker synchronization and replaced it with a pure ZMQ messaging system. Barrier control is now achieved by having each worker manage its own allocation state to ensure blocks are not reallocated twice.
  2. Introduced a dedicated PyO3 bindings binary for KVBM and removed the bindings code from dynamo-runtime.
  3. Made the DistributedRuntime optional in KVBM. Since Tokio thread pools cannot be passed across bindings, we now reinitialize the Tokio runtime thread pool within KVBM. This is also required when no DistributedRuntime is present, to provide a Tokio runtime environment for the bindings to operate.

seems to me these three changes could be separated into 3 individual PRs for easier review and validation?

I can understand if meeting the DDL makes it is easier to combine them into a single bigger one

@richardhuo-nv
Copy link
Contributor Author

richardhuo-nv commented Oct 22, 2025

Changes in KVBM:

  1. Removed the ETCD-based leader–worker synchronization and replaced it with a pure ZMQ messaging system. Barrier control is now achieved by having each worker manage its own allocation state to ensure blocks are not reallocated twice.
  2. Introduced a dedicated PyO3 bindings binary for KVBM and removed the bindings code from dynamo-runtime.
  3. Made the DistributedRuntime optional in KVBM. Since Tokio thread pools cannot be passed across bindings, we now reinitialize the Tokio runtime thread pool within KVBM. This is also required when no DistributedRuntime is present, to provide a Tokio runtime environment for the bindings to operate.

seems to me these three changes could be separated into 3 individual PRs for easier review and validation?

I can understand if meeting the DDL makes it is easier to combine them into a single bigger one

For the first one, I have a PR: #3202
But no one really give it a review till today. We can get this one merged first, can you give it a review?

For 2 and 3, we have to put them together for a dedicated wheel since generating the PyO3 bindings is somehow related to generating a pip wheel when using maturin and PyO3, you have to put up a pyproject.toml file for each binding and you cannot have two bindings for one pyproject.toml. This somehow bundled how we want to passing the DistributedRuntime across the bindings and make it optional.

@richardhuo-nv
Copy link
Contributor Author

/ok to test d5058da

@richardhuo-nv
Copy link
Contributor Author

/ok to test a9e904c

@richardhuo-nv
Copy link
Contributor Author

/ok to test 1784faa

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants