-
Notifications
You must be signed in to change notification settings - Fork 655
chore: KVBM pip wheel #3826
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
chore: KVBM pip wheel #3826
Conversation
first first fix debug debug debug debug debug debug debug debug debug debug fix fix fix fix fix fix fix fix fox fox fix fix fix
fix
optional drt optional drt fix bindings remove ucx nixl plugin loading fix fmt
| envs: | ||
| - name: DYN_KVBM_CPU_CACHE_GB | ||
| value: "100" | ||
| - name: DYN_KVBM_BARRIER_ID_PREFIX |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since DYN_KVBM_BARRIER_ID_PREFIX is removed, just want to make sure that with the new ZMQ sync impl, it would still work when there are multiple leaders in the system, e.g. more than 1 node using KVBM, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, leaders will not talk to ETCD anymore, they will just using the port in their own pod network space. So yeah, it still work.
WalkthroughThis pull request extracts KV cache block management functionality into a standalone Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Rationale: This refactoring involves substantial architectural changes (barrier→ZMQ synchronization, mandatory→optional runtime handling), heterogeneous edits across multiple languages (Rust/Python), new build infrastructure, and significant interdependencies between distributed components. While individual changes follow consistent patterns, the cumulative density of logic changes, new public APIs (zmq.rs handshake, distributed metadata types, PyO3 integration), and refactored runtime handling require careful, multi-faceted analysis. Poem
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 11
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (10)
components/backends/vllm/launch/disagg_kvbm_2p2d.sh (1)
16-16: Update stale comment referencing barrier IDs.This comment still references "barrier id prefixes," but the implementation now uses ZMQ ports for leader–worker coordination. Update the comment to reflect the new synchronization method.
Apply this diff:
-# NOTE: use different barrier id prefixes for each prefill worker to avoid conflicts +# NOTE: the first worker is the leader; the second worker uses ZMQ ports to connectlib/kvbm/tests/test_kvbm_vllm_integration.py (1)
822-822: Mock patch target appears inconsistent with actual imports.The mock patch decorator still references
dynamo.llm.vllm_integration.kv_cache_manager.KvbmCacheManager, but line 30 imports fromkvbm.vllm_integration.kv_cache_manager. The mock target should match the actual import path.Apply this diff to fix the mock target:
-@patch("dynamo.llm.vllm_integration.kv_cache_manager.KvbmCacheManager") +@patch("kvbm.vllm_integration.kv_cache_manager.KvbmCacheManager") def test_kvbm_new_matched_tokens_edge_case(MockCacheManager):lib/kvbm/src/block_manager.rs (1)
155-179: Don’t panic with unimplemented!; return an error to Python.unimplemented! will abort the process under load tests. Return a typed PyErr instead.
- unimplemented!("Leader not provided"); + return Err(to_pyerr(anyhow::anyhow!( + "Leader not provided. Pass KvbmLeader or use BlockManagerBuilder with explicit device/host/disk layout." + )));lib/kvbm/src/block_manager/vllm/connector/leader.rs (2)
151-158: Don’t call tokio::task::block_in_place hereThis path is likely executed outside a Tokio runtime; block_in_place will panic. Use Handle::block_on directly.
- tokio::task::block_in_place(|| { - handle.block_on(async { - match leader_ready_rx.await { - Ok(_) => tracing::info!("KvConnectorLeader init complete."), - Err(_) => tracing::warn!("KvConnectorLeader init channel dropped"), - } - }); - }); + handle.block_on(async { + match leader_ready_rx.await { + Ok(_) => tracing::info!("KvConnectorLeader init complete."), + Err(_) => tracing::warn!("KvConnectorLeader init channel dropped"), + } + });
339-352: Avoid panics: replace assert! with error or debug_assert!assert! will crash production on transient ordering issues. Prefer debug_assert! or return an error/log.
- assert!( - inflight_requests.remove(request_id), - "request_id {request_id} not found in inflight_requests: " - ); + if !inflight_requests.remove(request_id) { + tracing::warn!("request_id {} not found in inflight_requests", request_id); + }Apply similarly to other assert! sites in this file.
Also applies to: 421-424, 337-341
lib/llm/src/block_manager/distributed/worker.rs (1)
420-446: Require shape length >= 4 for FullyContiguousshape[3..] on len==3 yields empty product (1), making inner_dim bogus.
- if shape.len() < 3 { + if shape.len() < 4 { return Err(anyhow::anyhow!(format!( "Unsupported kv cache layout. Got shape: {:?}", shape ))); }lib/kvbm/src/block_manager/vllm/connector/worker.rs (2)
140-145: Don’t abort the process on bad input; return an error instead.
assert_eq!will panic in release builds. Validate and return anyhow::Error to keep the interpreter alive.- assert_eq!( - kv_caches.len(), - raw_event_handles.len(), - "kv_caches and raw_event_handles must have the same length" - ); + if kv_caches.len() != raw_event_handles.len() { + return Err(anyhow::anyhow!( + "kv_caches ({}) and raw_event_handles ({}) must have the same length", + kv_caches.len(), + raw_event_handles.len() + )); + }
389-396: Replace panics with recoverable errors/logs.These
panic!s can crash the whole process. Prefer logging + early-return or an error.- panic!( - "request slot missing for {request_id}; however, it was present when added to the maybe finished offloading set" - ); + tracing::error!( + request_id, + "request slot missing though previously present in maybe_finished_offloading; treating as aborted" + ); + continue;- panic!( - "request slot missing for {request_id}; however, it was present when added to the maybe finished onboarding set" - ); + tracing::error!( + request_id, + "request slot missing though previously present in maybe_finished_onboarding; treating as aborted" + ); + continue;Also applies to: 423-426
lib/kvbm/src/block_manager/vllm/connector/trtllm_worker.rs (1)
329-332: Avoid crashing on missing request slots.Replace
panic!with log + continue to keep service alive.- panic!( - "request slot missing for {request_id}; however, it was present when added to the maybe finished offloading set" - ); + tracing::error!( + request_id, + "request slot missing although previously tracked for offloading; treating as aborted" + ); + continue;- panic!( - "request slot missing for {request_id}; however, it was present when added to the maybe finished onboarding set" - ); + tracing::error!( + request_id, + "request slot missing although previously tracked for onboarding; treating as aborted" + ); + continue;Also applies to: 361-364
lib/llm/src/block_manager/distributed/zmq.rs (1)
231-280: Memory leak:pending_messagesentries frombroadcast()are never removed.Entries live forever because
pull_workerdoesn’t remove them for pure-ACK messages. Remove the entry when all ACKs are received andwant_payload == false.@@ pub async fn broadcast( @@ - let pending_message = PendingMessage { + let pending_message = PendingMessage { // We start with the number of workers we're waiting for. remaining_workers: *self.num_workers, - completion_indicator: Some(completion_indicator), - want_payload: false, - payloads: None, + completion_indicator: Some(completion_indicator), + want_payload: false, + payloads: None, }; @@ async fn pull_worker( mut pull_socket: Pull, pending_messages: Arc<Mutex<HashMap<usize, PendingMessage>>>, cancel_token: CancellationToken, ) -> Result<()> { @@ - let mut map = pending_messages.lock().await; - - if let Some(pm) = map.get_mut(&id) { + let mut map = pending_messages.lock().await; + // Track whether we should remove after we drop the mutable borrow. + let mut should_remove = false; + if let Some(pm) = map.get_mut(&id) { @@ - if pm.remaining_workers == 0 - && let Some(tx) = pm.completion_indicator.take() { - let _ = tx.send(()); - } + if pm.remaining_workers == 0 { + if let Some(tx) = pm.completion_indicator.take() { + let _ = tx.send(()); + } + // For pure-ACK messages we can remove immediately here to avoid leaks. + if !pm.want_payload { + should_remove = true; + } + } } else { // Late reply for a round we've already collected/removed. tracing::debug!("Leader PULL: late/unknown id {}", id); } + if should_remove { + map.remove(&id); + }Also applies to: 344-396
🧹 Nitpick comments (30)
components/backends/vllm/launch/disagg_kvbm_2p2d.sh (1)
26-27: Parameterize ZMQ ports to support runtime configuration in multi-instance deployments.The shell script currently sets ports to hardcoded values (56003, 56004), preventing override via environment variables without editing the script. While the KVBM implementation already supports environment variable configuration with validation (via
validated_port_from_env()inlib/kvbm/src/block_manager/distributed/utils.rs), the shell script doesn't expose this flexibility.Apply the suggested diff to allow runtime overrides:
-DYN_KVBM_LEADER_ZMQ_PUB_PORT=56003 \ -DYN_KVBM_LEADER_ZMQ_ACK_PORT=56004 \ +DYN_KVBM_LEADER_ZMQ_PUB_PORT=${DYN_KVBM_LEADER_ZMQ_PUB_PORT:-56003} \ +DYN_KVBM_LEADER_ZMQ_ACK_PORT=${DYN_KVBM_LEADER_ZMQ_ACK_PORT:-56004} \This preserves current behavior (56003/56004 defaults) while enabling users to override ports before script execution, resolving conflicts in containerized or multi-instance scenarios.
lib/kvbm/python/kvbm/vllm_integration/connector/dynamo_connector.py (1)
29-29: Consider removing commented import.The commented import on line 29 appears to be unused. Consider removing it rather than leaving commented code, as version control already preserves the history.
-# from kvbm.vllm_integration.kv_cache_utils import KvbmCacheBlocks from kvbm.vllm_integration.connector_leader import KvConnectorLeader from kvbm.vllm_integration.connector_worker import KvConnectorWorkerlib/kvbm/python/kvbm/vllm_integration/rust.py (1)
25-36: Use warnings and include the exception for easier diagnostics.Printing to stdout hides the root cause and is noisy. Prefer
warnings.warnand show the exception.-except ImportError: - print("Failed to import Dynamo KVBM. vLLM integration will not be available.") +import warnings +except ImportError as e: + warnings.warn(f"Failed to import KVBM; vLLM integration unavailable: {e}")container/build_kvbm_wheel.sh (3)
56-60: Harden docker build: pull latest base and disambiguate tag by arch.Prevents stale base images and tag collisions across architectures.
-docker build \ - ${BUILD_ARGS} \ - -t kvbm-wheel:tmp \ - -f container/Dockerfile.kvbm_wheel . +docker build \ + --pull \ + ${BUILD_ARGS} \ + -t kvbm-wheel:${ARCH}-tmp \ + -f container/Dockerfile.kvbm_wheel .
52-55: Also remove the temporary image in cleanup.Containers are removed; the image remains and can pile up.
-cid="" -cleanup() { [[ -n "${cid}" ]] && docker rm -v "$cid" >/dev/null 2>&1 || true; } +cid="" +cleanup() { + [[ -n "${cid}" ]] && docker rm -v "$cid" >/dev/null 2>&1 || true + docker image rm -f kvbm-wheel:${ARCH}-tmp >/dev/null 2>&1 || true +}
61-65: Fail fast if no wheel artifacts were produced.Avoids silently succeeding with an empty output directory.
mkdir -p "$OUTPUT_DIR" -docker cp "$cid":/opt/dynamo/dist/. "$OUTPUT_DIR"/ +docker cp "$cid":/opt/dynamo/dist/. "$OUTPUT_DIR"/ +if ! ls "$OUTPUT_DIR"/kvbm*.whl >/dev/null 2>&1; then + echo "No kvbm wheel found in $OUTPUT_DIR"; exit 1 +ficontainer/Dockerfile (1)
385-395: build.sh correctly passeskvbm_wheelbuild-context — primary concern verified and satisfied.The kvbm_wheel build-context is properly configured in build.sh (line 734):
BUILD_CONTEXT_ARG+=" --build-context kvbm_wheel=${KVBM_PIP_WHEEL_DIR}". The context is unconditionally created to satisfy Docker's build-context requirement, with an empty directory used when the wheel is not built.However, the suggested refactor for guarding missing files remains a reasonable defensive programming enhancement. When
ENABLE_KVBM=trueis passed to the Dockerfile but the wheel fails to build in build.sh, the glob expansionkvbm*.whlwill match nothing, causingcpto silently behave unexpectedly. Adding explicit file existence validation (as suggested in the refactor) would catch this scenario and fail fast with a clear error message.Recommendation: The build-context concern is resolved. Apply the suggested refactor for improved error handling and clarity:
RUN --mount=type=bind,from=kvbm_wheel,source=.,target=/mnt/kvbm_wheel,ro \ set -e; \ if [ "${ENABLE_KVBM}" = "true" ]; then \ mkdir -p /opt/dynamo/wheelhouse; \ - cp -v /mnt/kvbm_wheel/kvbm*.whl /opt/dynamo/wheelhouse/; \ + shopt -s nullglob; files=(/mnt/kvbm_wheel/kvbm*.whl); \ + if [ ${#files[@]} -eq 0 ]; then echo "No KVBM wheel in build-context kvbm_wheel"; exit 1; fi; \ + cp -v "${files[@]}" /opt/dynamo/wheelhouse/; \ uv pip install /opt/dynamo/wheelhouse/kvbm*.whl; \ filib/kvbm/python/kvbm/vllm_integration/connector_leader.py (4)
37-39: Validate DistributedRuntime import path.Elsewhere the type lives under dynamo._core.DistributedRuntime. Confirm that dynamo.runtime.DistributedRuntime is the intended public path, or import from dynamo._core to avoid ImportError when enabled.
Consider lazy-importing inside init to avoid import-time crashes when the env flag is toggled. Based on learnings.
71-74: Avoid print in library path; use logging.Replace print with logging to respect caller logging config.
- print(f"KvConnectorLeader initialized with engine_id: {engine_id}") + import logging + logging.getLogger(__name__).info( + "KvConnectorLeader initialized: engine_id=%s", engine_id + )
166-170: Double‑check scheduler token assertion fields.You set output.add_num_scheduled_tokens(scheduler_output.num_scheduled_tokens) but assert equality with scheduler_output.total_num_scheduled_tokens. Ensure these semantics match; otherwise the assertion can trip.
If mismatch, assert against num_scheduled_tokens or set output using total_num_scheduled_tokens for consistency.
195-217: Avoid parameter shadowing; rename local request.Rebinding request to KvbmRequest harms readability and future edits.
- request = KvbmRequest( + kvbm_req = KvbmRequest( request_id=request.request_id, lora_name=request.lora_request.lora_name() if request.lora_request else None, salt_hash=request.cache_salt, ) - - self._connector.create_slot(request, all_token_ids) + self._connector.create_slot(kvbm_req, all_token_ids)lib/kvbm/src/block_manager/distributed/utils.rs (1)
50-64: Optional: avoid repeated env reads and allocate fewer Strings.Cache host once and return ports as u16 to format at callsite; reduces tiny overhead.
-pub fn get_leader_zmq_pub_url() -> String { - format!( - "tcp://{}:{}", - get_leader_zmq_host(), - get_leader_zmq_pub_port() - ) -} +pub fn get_leader_zmq_pub_url() -> String { + let host = get_leader_zmq_host(); + format!("tcp://{}:{}", host, get_leader_zmq_pub_port()) +}(Same tweak for get_leader_zmq_ack_url.)
container/Dockerfile.kvbm_wheel (2)
152-153: BuildKit assumption: verify remote stage COPY works in CI.COPY --from=ghcr.io/astral-sh/uv:latest requires BuildKit. Ensure buildx/BuildKit is enabled in CI runners.
If not guaranteed, add an explicit first stage:
FROM ghcr.io/astral-sh/uv:latest as uv ... COPY --from=uv /uv /uvx /bin/
276-282: Wheel bloat and licensing risk from vendoring full Nixl tree.Copying /usr/local/nixl/lib64/* into kvbm/nixl-cu13 may bundle unused/ABI-variant .so’s and plugins. Prefer minimal set and let auditwheel/maturin handle delocation.
- Package only required shared libs and plugin subset; avoid static dev files.
- Consider separate nixl-cu13 wheel dependency once available (as noted in TODO) and remove manual vendoring.
Track wheel size in CI and fail if it exceeds a threshold (e.g., >200MB).
lib/kvbm/src/block_manager.rs (2)
47-73: Disk offload filter: make env parsing case‑insensitive and explicit.Current check only accepts "true"/"1". Consider normalize and support "TRUE/True/yes".
- let disable_filter = std::env::var("DYN_KVBM_DISABLE_DISK_OFFLOAD_FILTER") - .map(|v| v == "true" || v == "1") + let disable_filter = std::env::var("DYN_KVBM_DISABLE_DISK_OFFLOAD_FILTER") + .map(|v| { + let v = v.trim().to_ascii_lowercase(); + v == "true" || v == "1" || v == "yes" + }) .unwrap_or(false);
291-295: Stale error message.“runtime is always taken from leader” no longer holds with get_current_tokio_handle(). Update to reflect current behavior.
- anyhow::anyhow!("leader is required (runtime is always taken from leader)") + anyhow::anyhow!("leader is required to derive layouts; runtime comes from the global Tokio handle")lib/kvbm/src/lib.rs (1)
81-86: Duplicate to_pyerr across cratesto_pyerr here duplicates lib/bindings/python/rust/lib.rs. Prefer a shared helper to avoid divergence.
Consider moving to_pyerr into a common crate (e.g., dynamo-runtime or a small shared utils crate) and reuse.
lib/kvbm/Cargo.toml (2)
42-46: Pin 0.x dependencies to a minor version to avoid accidental breaking updates"0" allows any 0.x which can be breaking per semver. Recommend pinning.
-tokio-stream = { version = "0" } -tracing = { version = "0" } +tokio-stream = { version = "0.1" } +tracing = { version = "0.1" }Consider similar pins for futures/either/thiserror if needed.
25-28: Add rust-version and resolver to ensure reproducible buildsExplicit MSRV and Cargo resolver help CI/users.
[package] name = "kvbm-py3" version = "0.1.0" edition = "2024" +rust-version = "1.80" # or the required version for edition 2024 once verified @@ [lib] @@ crate-type = ["cdylib", "rlib"] + +[workspace] +resolver = "2"Adjust rust-version after verification.
lib/kvbm/src/block_manager/vllm/connector/leader.rs (2)
545-562: Clarify unused drt parameterdrt is intentionally unused for leader. Keep for API stability but silence lint explicitly.
- let _ = &drt; // drt is currently un-used in leader + let _ = &drt; // API compatibility; leader path does not require drt + #[allow(unused_variables)] + let drt_for_compat = &drt;Alternatively, add #[allow(unused_variables)] on the function.
105-147: Use oneshot<()> for readiness signalThe payload is never used; use unit to reduce allocations.
- let (leader_ready_tx, leader_ready_rx) = oneshot::channel::<String>(); + let (leader_ready_tx, leader_ready_rx) = oneshot::channel::<()>(); @@ - if leader_ready_tx.send("finished".to_string()).is_err() { + if leader_ready_tx.send(()).is_err() { tracing::error!("main routine receiver dropped before result was sent"); }lib/llm/src/block_manager/distributed/worker.rs (2)
35-51: Relax atomic ordering for WorkerStateSeqCst is likely overkill; Acquire on read, Release on write suffices for a readiness flag.
- fn mark_ready(&self) { - self.ready_for_ping.store(true, Ordering::SeqCst); - } + fn mark_ready(&self) { + self.ready_for_ping.store(true, Ordering::Release); + } - fn is_ready(&self) -> bool { - self.ready_for_ping.load(Ordering::SeqCst) - } + fn is_ready(&self) -> bool { + self.ready_for_ping.load(Ordering::Acquire) + }
579-639: Non-blocking init: ensure background worker join errors are surfacedThe spawned worker_task error only logs; consider propagating via a channel so callers can fail fast.
Add an oneshot<Result<()>> to forward worker_task completion errors to the orchestrator and expose via builder or metrics.
lib/kvbm/src/block_manager/distributed/leader.rs (1)
72-82: Optional DistributedRuntime extraction: good, but add a hint in errorIf instance check fails, include repr/type to aid debugging.
- return Err(PyTypeError::new_err( - "expected dynamo._core.DistributedRuntime", - )); + return Err(PyTypeError::new_err(format!( + "expected dynamo._core.DistributedRuntime; got {}", + obj.get_type().name()? + )));lib/kvbm/src/block_manager/vllm/connector/worker.rs (1)
449-457: Makepy_drtoptional at the Python call site.Expose a default of
Noneto avoid forcing callers to pass an argument.- #[pyo3(signature = (py_drt, vllm_worker_id))] - pub fn new(py_drt: Option<PyObject>, vllm_worker_id: String) -> PyResult<Self> { + #[pyo3(signature = (py_drt=None, vllm_worker_id))] + pub fn new(py_drt: Option<PyObject>, vllm_worker_id: String) -> PyResult<Self> {lib/llm/src/block_manager/distributed/leader.rs (2)
27-36: Use GiB (1<<30) or document decimal GB explicitly.Current math uses 1,000,000,000 bytes/GB. If caller expects GiB, allocations will be off. Either switch to 1_073_741_824 or clarify env var units.
- ((num_blocks_config.cache_size_in_gb * 1_000_000_000.0) / bytes_per_block as f64) as usize + ((num_blocks_config.cache_size_in_gb * 1_073_741_824.0) / bytes_per_block as f64) as usize
127-131: Keep a cancel token to shut down background tasks on drop.
spawn_zmq_taskuses a token that isn’t retained. Store it in the struct and cancel inDropto prevent orphaned tasks.lib/kvbm/python/kvbm/_core.pyi (2)
112-151: Exposeblock_size()in the stub.Used by
KvbmCacheManager; missing here.class BlockManager: @@ ) -> None: """ Create a `BlockManager` object @@ """ ... + def block_size(self) -> int: ...
32-36: Doc nit: it’s “in the block,” not “in the list.”- Get the number of layers in the list + Get the number of layers in the blocklib/kvbm/src/block_manager/vllm/connector/trtllm_worker.rs (1)
397-405: Makepy_drtoptional at the Python call site.- #[pyo3(signature = (py_drt, trtllm_rank))] - pub fn new(py_drt: Option<PyObject>, trtllm_rank: String) -> PyResult<Self> { + #[pyo3(signature = (py_drt=None, trtllm_rank))] + pub fn new(py_drt: Option<PyObject>, trtllm_rank: String) -> PyResult<Self> {
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (2)
Cargo.lockis excluded by!**/*.locklib/bindings/python/Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (61)
Cargo.toml(1 hunks)components/backends/vllm/deploy/disagg_kvbm_2p2d.yaml(0 hunks)components/backends/vllm/deploy/disagg_kvbm_tp2.yaml(0 hunks)components/backends/vllm/launch/disagg_kvbm_2p2d.sh(1 hunks)components/src/dynamo/vllm/args.py(2 hunks)container/Dockerfile(3 hunks)container/Dockerfile.kvbm_wheel(1 hunks)container/Dockerfile.trtllm(1 hunks)container/Dockerfile.vllm(1 hunks)container/build.sh(4 hunks)container/build_kvbm_wheel.sh(1 hunks)container/deps/kvbm/install_cuda13.sh(1 hunks)container/deps/trtllm/install_nixl.sh(1 hunks)docs/kvbm/trtllm-setup.md(1 hunks)docs/kvbm/vllm-setup.md(1 hunks)lib/bindings/python/Cargo.toml(0 hunks)lib/bindings/python/rust/lib.rs(2 hunks)lib/bindings/python/rust/llm.rs(0 hunks)lib/bindings/python/rust/llm/block_manager/distributed/utils.rs(0 hunks)lib/bindings/python/src/dynamo/llm/__init__.py(0 hunks)lib/kvbm/Cargo.toml(1 hunks)lib/kvbm/LICENSE(1 hunks)lib/kvbm/README.md(1 hunks)lib/kvbm/pyproject.toml(1 hunks)lib/kvbm/python/kvbm/__init__.py(1 hunks)lib/kvbm/python/kvbm/_core.pyi(1 hunks)lib/kvbm/python/kvbm/trtllm_integration/connector/kvbm_connector_leader.py(1 hunks)lib/kvbm/python/kvbm/trtllm_integration/connector/kvbm_connector_worker.py(1 hunks)lib/kvbm/python/kvbm/trtllm_integration/rust.py(1 hunks)lib/kvbm/python/kvbm/utils.py(1 hunks)lib/kvbm/python/kvbm/vllm_integration/connector/dynamo_connector.py(1 hunks)lib/kvbm/python/kvbm/vllm_integration/connector/pd_connector.py(1 hunks)lib/kvbm/python/kvbm/vllm_integration/connector_leader.py(2 hunks)lib/kvbm/python/kvbm/vllm_integration/connector_worker.py(3 hunks)lib/kvbm/python/kvbm/vllm_integration/kv_cache_manager.py(1 hunks)lib/kvbm/python/kvbm/vllm_integration/kv_cache_utils.py(1 hunks)lib/kvbm/python/kvbm/vllm_integration/rust.py(2 hunks)lib/kvbm/src/block_manager.rs(9 hunks)lib/kvbm/src/block_manager/distributed.rs(1 hunks)lib/kvbm/src/block_manager/distributed/leader.rs(3 hunks)lib/kvbm/src/block_manager/distributed/utils.rs(1 hunks)lib/kvbm/src/block_manager/distributed/worker.rs(6 hunks)lib/kvbm/src/block_manager/vllm.rs(1 hunks)lib/kvbm/src/block_manager/vllm/connector/leader.rs(3 hunks)lib/kvbm/src/block_manager/vllm/connector/leader/recorder.rs(2 hunks)lib/kvbm/src/block_manager/vllm/connector/leader/slot.rs(5 hunks)lib/kvbm/src/block_manager/vllm/connector/trtllm_leader.rs(3 hunks)lib/kvbm/src/block_manager/vllm/connector/trtllm_worker.rs(7 hunks)lib/kvbm/src/block_manager/vllm/connector/worker.rs(6 hunks)lib/kvbm/src/lib.rs(1 hunks)lib/kvbm/tests/test_kvbm_vllm_integration.py(1 hunks)lib/llm/Cargo.toml(1 hunks)lib/llm/src/block_manager/distributed.rs(0 hunks)lib/llm/src/block_manager/distributed/leader.rs(5 hunks)lib/llm/src/block_manager/distributed/utils.rs(1 hunks)lib/llm/src/block_manager/distributed/worker.rs(10 hunks)lib/llm/src/block_manager/distributed/zmq.rs(10 hunks)lib/llm/src/block_manager/state/resources.rs(0 hunks)pyproject.toml(1 hunks)tests/kvbm_integration/README.md(2 hunks)tests/kvbm_integration/test_determinism_agg.py(2 hunks)
💤 Files with no reviewable changes (8)
- lib/bindings/python/rust/llm.rs
- lib/llm/src/block_manager/distributed.rs
- components/backends/vllm/deploy/disagg_kvbm_tp2.yaml
- lib/llm/src/block_manager/state/resources.rs
- lib/bindings/python/rust/llm/block_manager/distributed/utils.rs
- lib/bindings/python/Cargo.toml
- components/backends/vllm/deploy/disagg_kvbm_2p2d.yaml
- lib/bindings/python/src/dynamo/llm/init.py
🧰 Additional context used
🧬 Code graph analysis (29)
lib/kvbm/python/kvbm/vllm_integration/kv_cache_manager.py (2)
lib/kvbm/python/kvbm/vllm_integration/kv_cache_utils.py (1)
KvbmCacheBlocks(20-64)lib/kvbm/python/kvbm/_core.pyi (3)
BlockManager(112-214)KvbmCacheManager(216-222)KvbmRequest(225-231)
lib/kvbm/python/kvbm/vllm_integration/connector/dynamo_connector.py (2)
lib/kvbm/python/kvbm/vllm_integration/connector_leader.py (1)
KvConnectorLeader(48-217)lib/kvbm/python/kvbm/vllm_integration/connector_worker.py (1)
KvConnectorWorker(52-212)
lib/kvbm/python/kvbm/vllm_integration/rust.py (3)
lib/kvbm/src/lib.rs (1)
_core(22-30)lib/kvbm/src/block_manager/vllm.rs (1)
_vllm_integration(41-57)lib/kvbm/python/kvbm/_core.pyi (1)
BlockManager(112-214)
lib/kvbm/src/block_manager/vllm/connector/leader/recorder.rs (2)
lib/kvbm/src/block_manager/vllm/connector/leader.rs (2)
new(89-168)new(543-562)lib/kvbm/src/lib.rs (1)
get_current_tokio_handle(66-72)
lib/bindings/python/rust/lib.rs (2)
lib/runtime/src/distributed.rs (1)
fmt(37-39)lib/bindings/python/src/dynamo/_core.pyi (1)
DistributedRuntime(35-65)
lib/kvbm/python/kvbm/trtllm_integration/connector/kvbm_connector_leader.py (5)
lib/kvbm/python/kvbm/_core.pyi (1)
KvbmRequest(225-231)lib/kvbm/python/kvbm/vllm_integration/connector_leader.py (1)
KvConnectorLeader(48-217)lib/kvbm/python/kvbm/utils.py (1)
is_dyn_runtime_enabled(73-85)lib/bindings/python/src/dynamo/_core.pyi (1)
DistributedRuntime(35-65)lib/bindings/python/rust/lib.rs (1)
detached(467-479)
lib/kvbm/src/block_manager/distributed/utils.rs (2)
lib/kvbm/src/block_manager/distributed/leader.rs (3)
std(22-22)std(26-28)v(53-53)lib/kvbm/src/block_manager/vllm/connector/leader.rs (1)
val(617-617)
lib/kvbm/src/block_manager/vllm.rs (1)
lib/kvbm/python/kvbm/_core.pyi (1)
BlockManager(112-214)
lib/kvbm/src/block_manager/distributed.rs (1)
lib/kvbm/src/block_manager/distributed/utils.rs (2)
get_leader_zmq_ack_url(58-64)get_leader_zmq_pub_url(50-56)
lib/llm/src/block_manager/distributed/leader.rs (4)
lib/kvbm/src/block_manager/distributed/leader.rs (3)
new(74-107)std(22-22)std(26-28)lib/kvbm/src/block_manager/distributed/worker.rs (2)
new(64-93)new(147-214)lib/kvbm/src/block_manager/vllm/connector/leader.rs (2)
new(89-168)new(543-562)lib/llm/src/block_manager/distributed/zmq.rs (3)
new_leader_sockets(45-67)new_with_handshake(87-229)std(418-418)
lib/kvbm/python/kvbm/vllm_integration/connector_worker.py (3)
lib/kvbm/python/kvbm/utils.py (3)
is_cuda_13(13-45)is_dyn_runtime_enabled(73-85)set_cu13_nixl_plugin_path(51-70)lib/bindings/python/src/dynamo/_core.pyi (1)
DistributedRuntime(35-65)lib/bindings/python/rust/lib.rs (1)
detached(467-479)
lib/kvbm/python/kvbm/vllm_integration/connector_leader.py (3)
lib/kvbm/python/kvbm/utils.py (1)
is_dyn_runtime_enabled(73-85)lib/bindings/python/src/dynamo/_core.pyi (1)
DistributedRuntime(35-65)lib/bindings/python/rust/lib.rs (1)
detached(467-479)
lib/kvbm/python/kvbm/__init__.py (3)
lib/kvbm/src/lib.rs (1)
_core(22-30)lib/kvbm/python/kvbm/_core.pyi (1)
BlockManager(112-214)lib/llm/src/block_manager/distributed/worker.rs (3)
KvbmWorker(131-131)KvbmWorker(144-144)KvbmWorker(160-160)
lib/kvbm/python/kvbm/trtllm_integration/rust.py (2)
lib/kvbm/src/lib.rs (1)
_core(22-30)lib/kvbm/src/block_manager/vllm.rs (1)
_vllm_integration(41-57)
lib/llm/src/block_manager/distributed/worker.rs (4)
lib/kvbm/src/block_manager/distributed/leader.rs (1)
new(74-107)lib/llm/src/block_manager/distributed/zmq.rs (2)
handle(500-500)bincode(171-171)lib/kvbm/src/block_manager/distributed/worker.rs (2)
new(64-93)new(147-214)lib/llm/src/block_manager/distributed/transfer.rs (1)
handle(216-264)
lib/kvbm/src/block_manager/vllm/connector/leader/slot.rs (2)
lib/kvbm/src/lib.rs (2)
get_current_cancel_token(74-79)get_current_tokio_handle(66-72)lib/runtime/src/utils/tasks/critical.rs (1)
new_with_runtime(56-156)
lib/kvbm/python/kvbm/vllm_integration/connector/pd_connector.py (1)
lib/kvbm/python/kvbm/vllm_integration/connector/dynamo_connector.py (1)
DynamoConnector(42-127)
lib/kvbm/python/kvbm/trtllm_integration/connector/kvbm_connector_worker.py (4)
lib/kvbm/python/kvbm/utils.py (3)
is_cuda_13(13-45)is_dyn_runtime_enabled(73-85)set_cu13_nixl_plugin_path(51-70)lib/kvbm/python/kvbm/vllm_integration/connector_worker.py (1)
KvConnectorWorker(52-212)lib/bindings/python/src/dynamo/_core.pyi (1)
DistributedRuntime(35-65)lib/bindings/python/rust/lib.rs (1)
detached(467-479)
lib/kvbm/src/block_manager/vllm/connector/trtllm_leader.rs (3)
lib/kvbm/src/block_manager/vllm/connector/leader.rs (4)
kvbm_metrics_endpoint_enabled(609-613)parse_kvbm_metrics_port(615-634)new(89-168)new(543-562)lib/kvbm/src/lib.rs (1)
get_current_tokio_handle(66-72)lib/kvbm/src/block_manager/vllm/connector/leader/recorder.rs (1)
new(88-187)
lib/kvbm/src/block_manager/vllm/connector/leader.rs (4)
lib/kvbm/src/block_manager.rs (5)
leader(272-275)new(89-200)new(257-262)worker_id(264-267)page_size(268-271)lib/kvbm/src/lib.rs (1)
get_current_tokio_handle(66-72)lib/kvbm/src/block_manager/vllm/connector/leader/recorder.rs (1)
new(88-187)lib/kvbm/src/block_manager/vllm/connector/trtllm_leader.rs (2)
new(66-135)new(442-452)
lib/kvbm/src/block_manager/distributed/leader.rs (2)
lib/kvbm/src/block_manager/distributed/utils.rs (2)
get_leader_zmq_ack_url(58-64)get_leader_zmq_pub_url(50-56)lib/kvbm/src/lib.rs (2)
extract_distributed_runtime_from_obj(94-120)get_current_tokio_handle(66-72)
container/build.sh (1)
container/run.sh (2)
missing_requirement(345-347)error(349-352)
lib/kvbm/src/block_manager/vllm/connector/trtllm_worker.rs (4)
lib/kvbm/src/block_manager/distributed/utils.rs (2)
get_leader_zmq_ack_url(58-64)get_leader_zmq_pub_url(50-56)lib/bindings/python/rust/lib.rs (3)
to_pyerr(200-205)new(428-464)new(1105-1109)lib/kvbm/src/lib.rs (4)
to_pyerr(81-86)extract_distributed_runtime_from_obj(94-120)get_current_cancel_token(74-79)get_current_tokio_handle(66-72)lib/kvbm/src/block_manager/distributed/worker.rs (2)
new(64-93)new(147-214)
lib/kvbm/src/block_manager/vllm/connector/worker.rs (3)
lib/kvbm/src/block_manager/distributed/utils.rs (2)
get_leader_zmq_ack_url(58-64)get_leader_zmq_pub_url(50-56)lib/bindings/python/rust/lib.rs (3)
to_pyerr(200-205)new(428-464)new(1105-1109)lib/kvbm/src/lib.rs (4)
to_pyerr(81-86)extract_distributed_runtime_from_obj(94-120)get_current_cancel_token(74-79)get_current_tokio_handle(66-72)
lib/llm/src/block_manager/distributed/zmq.rs (2)
lib/kvbm/src/block_manager/distributed/worker.rs (2)
new(64-93)new(147-214)lib/llm/src/block_manager/distributed/leader.rs (1)
new(118-131)
lib/kvbm/src/lib.rs (3)
lib/kvbm/src/block_manager/distributed/leader.rs (3)
std(22-22)std(26-28)new(74-107)lib/bindings/python/rust/lib.rs (29)
m(145-145)m(146-146)m(147-147)m(148-148)m(149-149)m(150-150)m(151-151)m(152-152)m(153-153)m(154-154)m(155-155)m(156-156)m(157-157)m(158-158)m(159-159)m(160-160)new(428-464)new(1105-1109)to_pyerr(200-205)PyErr(237-237)PyErr(242-242)PyErr(266-266)PyErr(278-278)PyErr(502-502)PyErr(508-508)PyErr(519-519)PyErr(534-534)PyErr(614-614)PyErr(1074-1074)lib/kvbm/src/block_manager.rs (2)
new(89-200)new(257-262)
lib/kvbm/src/block_manager.rs (2)
lib/bindings/python/src/dynamo/_core.pyi (2)
DistributedRuntime(35-65)CancellationToken(67-78)lib/kvbm/src/lib.rs (1)
get_current_tokio_handle(66-72)
lib/kvbm/python/kvbm/_core.pyi (5)
lib/kvbm/src/block_manager/block_list.rs (1)
to_list(41-48)lib/kvbm/src/block_manager/block.rs (1)
to_list(68-75)lib/kvbm/src/block_manager.rs (3)
worker_id(264-267)page_size(268-271)block_size(202-204)lib/kvbm/src/block_manager/dlpack.rs (1)
dtype(46-65)lib/kvbm/python/kvbm/vllm_integration/kv_cache_manager.py (1)
KvbmCacheManager(35-416)
lib/kvbm/src/block_manager/distributed/worker.rs (2)
lib/kvbm/src/block_manager/distributed/utils.rs (2)
get_leader_zmq_ack_url(58-64)get_leader_zmq_pub_url(50-56)lib/kvbm/src/lib.rs (3)
extract_distributed_runtime_from_obj(94-120)get_current_tokio_handle(66-72)get_current_cancel_token(74-79)
🪛 Ruff (0.14.1)
lib/kvbm/python/kvbm/utils.py
29-30: try-except-pass detected, consider logging the exception
(S110)
29-29: Do not catch blind exception: Exception
(BLE001)
42-43: try-except-pass detected, consider logging the exception
(S110)
42-42: Do not catch blind exception: Exception
(BLE001)
48-48: Comment contains ambiguous ’ (RIGHT SINGLE QUOTATION MARK). Did you mean ``` (GRAVE ACCENT)?
(RUF003)
65-65: Do not catch blind exception: Exception
(BLE001)
lib/kvbm/python/kvbm/vllm_integration/connector_worker.py
17-17: Comment contains ambiguous ’ (RIGHT SINGLE QUOTATION MARK). Did you mean ``` (GRAVE ACCENT)?
(RUF003)
19-19: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
lib/kvbm/python/kvbm/trtllm_integration/connector/kvbm_connector_worker.py
11-11: Comment contains ambiguous ’ (RIGHT SINGLE QUOTATION MARK). Did you mean ``` (GRAVE ACCENT)?
(RUF003)
13-13: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
🪛 Shellcheck (0.11.0)
container/deps/kvbm/install_cuda13.sh
[warning] 48-48: NV_LIBNCCL_VERSION appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 49-49: NCCL_VERSION appears unused. Verify use (or export if used externally).
(SC2034)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
- GitHub Check: trtllm (amd64)
- GitHub Check: trtllm (arm64)
- GitHub Check: vllm (arm64)
- GitHub Check: vllm (amd64)
- GitHub Check: sglang
- GitHub Check: clippy (launch/dynamo-run)
- GitHub Check: clippy (lib/runtime/examples)
- GitHub Check: tests (launch/dynamo-run)
- GitHub Check: clippy (.)
- GitHub Check: tests (lib/runtime/examples)
- GitHub Check: Build and Test - dynamo
- GitHub Check: tests (lib/bindings/python)
- GitHub Check: tests (.)
|
|
||
| def is_dyn_runtime_enabled() -> bool: | ||
| """ | ||
| Return True if DYN_RUNTIME_ENABLED_KVBM is set to '1' or 'true' (case-insensitive). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the use case to set DYN_RUNTIME_ENABLED_KVBM to true?
let's update the runbooks accordingly for those use cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is for when we would like to use DistributedRuntime in KVBM.
Ideally, we don't need this, and we should simply do a fallback when DistributedRuntime is not available
try:
DistributedRuntime.detached()
except:
# no DRT code...
But DistributedRuntime.detached() will crash the entire process when nats and etcd are not availiable, and will not raise a catch-able python exception.
So setting this ENV VAR to give an option when someone really want to try out DistributedRuntime.
I don't really feel we have a use case that depends on DistributedRuntime. Adding this to runbooks might confuse a lot of users if they don't have enough background of distributed runtime. This is just for having DRT as an option and keeping a soft connection between kvbm and dynamo-runtime in rust.
seems to me these three changes could be separated into 3 individual PRs for easier review and validation? I can understand if meeting the DDL makes it is easier to combine them into a single bigger one |
For the first one, I have a PR: #3202 For 2 and 3, we have to put them together for a dedicated wheel since generating the PyO3 bindings is somehow related to generating a pip wheel when using maturin and PyO3, you have to put up a pyproject.toml file for each binding and you cannot have two bindings for one pyproject.toml. This somehow bundled how we want to passing the DistributedRuntime across the bindings and make it optional. |
|
/ok to test d5058da |
|
/ok to test a9e904c |
|
/ok to test 1784faa |
Overview:
Having a dedicated KVBM pip wheel introduces several implications:
Details:
Changes in KVBM:
Changes in CI:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
New Features
kvbmworkspace package as a standalone KV cache block manager module.Bug Fixes
Chores