-
Notifications
You must be signed in to change notification settings - Fork 655
feat: KVBM V2 Initial Migration #3861
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: jthomson04 <[email protected]>
WalkthroughThis pull request introduces comprehensive CUDA kernel infrastructure and a new v2 block manager subsystem for the LLM framework. It adds tensor transformation kernels (kvbm-kernels), PyTorch Python bindings, a modular memory abstraction layer supporting system/pinned/device/disk storage, physical layout representations (fully contiguous and layer-separated), a transport manager for handling block transfers across devices, and supporting transfer execution paths. Dependency versions are updated and bincode serialization is migrated to v2. Changes
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Rationale: This PR is substantial in scope with multiple interconnected subsystems spanning CUDA kernels, Rust memory abstractions, layout management, distributed transfer orchestration, and test infrastructure. Review complexity is driven by: (1) high logic density in layout calculations, NIXL integration, and transfer dispatching; (2) diverse file types and concerns (CUDA C++, Rust unsafe blocks, serialization, async/await, trait hierarchies); (3) significant interdependencies requiring understanding of memory allocation, device persistence, and transfer semantics across heterogeneous storage kinds; (4) non-trivial design patterns (typestate builders, trait objects, Arc-wrapped shared state); (5) intricate error handling across CUDA, NIXL, and file I/O boundaries. While many changes follow consistent patterns, the sheer breadth and interconnectedness necessitate careful, methodical review to ensure correctness in memory safety, type erasure, serialization round-trips, and transfer correctness. Poem
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 68
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
Cargo.toml (1)
41-69: Fix executable bit and add SPDX header tolib/kvbm-kernels/Cargo.toml.The CI issues are confirmed:
lib/kvbm-kernels/Cargo.tomlhas the executable bit set and must be removed.lib/kvbm-kernels/Cargo.tomlis missing both SPDX-FileCopyrightText and SPDX-License-Identifier headers.Fix with:
git update-index --chmod=-x lib/kvbm-kernels/Cargo.toml # Then add SPDX headers matching other crate manifests in the workspace
🧹 Nitpick comments (83)
lib/runtime/src/config.rs (1)
400-411: LGTM! Clean boolean parser with proper error handling.The function correctly delegates to
is_truthyandis_falsey, and provides a clear error message. The implementation is straightforward and the error message accurately lists all accepted values.Consider adding dedicated tests for
parse_boolto explicitly verify error handling:#[test] fn test_parse_bool() { // Test truthy values assert_eq!(parse_bool("true").unwrap(), true); assert_eq!(parse_bool("1").unwrap(), true); // Test falsey values assert_eq!(parse_bool("false").unwrap(), false); assert_eq!(parse_bool("0").unwrap(), false); // Test invalid value let result = parse_bool("invalid"); assert!(result.is_err()); assert!(result.unwrap_err().to_string().contains("Invalid boolean value")); }lib/llm/src/block_manager/v2/physical/transfer/testing.rs (6)
214-220: Avoid potential panic: use HashMap::get() instead of indexingIndexing HashMap with [] will panic if a key is missing. Safer to use get() and error early.
Apply:
- for (src_id, _, dst_id) in &self.block_mapping { - let src_checksum = &source_checksums[src_id]; - let dst_checksum = &dest_checksums[dst_id]; - - if src_checksum != dst_checksum { - mismatches.push((*src_id, *dst_id)); - } - } + for (src_id, _, dst_id) in &self.block_mapping { + let (Some(src_checksum), Some(dst_checksum)) = + (source_checksums.get(src_id), dest_checksums.get(dst_id)) + else { + return Err(anyhow!( + "Missing checksum for src_id={} or dst_id={}", + src_id, + dst_id + )); + }; + if src_checksum != dst_checksum { + mismatches.push((*src_id, *dst_id)); + } + }
174-176: De-duplicate vector builds; reuse precomputed IDsYou already have src_ids; inter IDs are built twice. Compute once and reuse to reduce allocations and noise.
- // Step 3: Transfer source → intermediate - let src_ids_intermediate: Vec<usize> = - self.block_mapping.iter().map(|(src, _, _)| *src).collect(); - let inter_ids_from_src: Vec<usize> = self - .block_mapping - .iter() - .map(|(_, inter, _)| *inter) - .collect(); + // Step 3: Transfer source → intermediate + // Reuse src_ids; precompute inter_ids once + let inter_ids: Vec<usize> = self + .block_mapping + .iter() + .map(|(_, inter, _)| *inter) + .collect(); let notification = transfer_blocks( &self.source, &self.intermediate, - &src_ids_intermediate, - &inter_ids_from_src, + &src_ids, + &inter_ids, ctx, )?; notification.await?; // Step 4: Transfer intermediate → destination - let inter_ids_to_dst: Vec<usize> = self - .block_mapping - .iter() - .map(|(_, inter, _)| *inter) - .collect(); - let dst_ids_from_inter: Vec<usize> = - self.block_mapping.iter().map(|(_, _, dst)| *dst).collect(); + let dst_ids_from_inter: Vec<usize> = + self.block_mapping.iter().map(|(_, _, dst)| *dst).collect(); let notification = transfer_blocks( &self.intermediate, &self.destination, - &inter_ids_to_dst, + &inter_ids, &dst_ids_from_inter, ctx, )?;Also applies to: 191-196
29-31: Expose intermediate IDs in the result for easier debuggingResult only returns (src,dst) mapping; when debugging failures, knowing the intermediate ID is valuable.
Minimal additive change (keep existing field for compatibility):
pub struct RoundTripTestResult { @@ - /// Block ID mapping used (src_id, dst_id) - pub block_mapping: Vec<(usize, usize)>, + /// Block ID mapping used (src_id, dst_id) + pub block_mapping: Vec<(usize, usize)>, + /// Full triple mapping (src_id, intermediate_id, dst_id) + pub triplet_mapping: Vec<(usize, usize, usize)>, @@ let block_mapping: Vec<(usize, usize)> = self .block_mapping .iter() .map(|(src, _, dst)| (*src, *dst)) .collect(); Ok(RoundTripTestResult { source_checksums, dest_checksums, block_mapping, + triplet_mapping: self.block_mapping.clone(), success, mismatches, })Also applies to: 223-228
8-10: Doc nit: “host” → “local (non-remote)” to match code behaviorSource/destination are constrained to non-remote, not strictly host-only. Consider clarifying wording.
-//! 1. Source blocks (host) → Intermediate (device/disk/remote) -//! 2. Intermediate → Destination blocks (host, different IDs) +//! 1. Source blocks (local/host) → Intermediate (device/disk/remote) +//! 2. Intermediate → Destination blocks (local/host, different IDs)
103-108: Locality checks: confirm semantics of is_remote vs “host-only” requirementYou reject only remote layouts. If true requirement is “CPU-host memory only,” also guard against device/disk here via StorageKind. If non-remote (device/disk) is acceptable, update docs accordingly.
Option if host-only is intended:
- if source.is_remote() { + if source.is_remote() || !matches!(source.location(), StorageKind::System | StorageKind::Pinned) { return Err(anyhow!("Source layout must be local")); } - if destination.is_remote() { + if destination.is_remote() || !matches!(destination.location(), StorageKind::System | StorageKind::Pinned) { return Err(anyhow!("Destination layout must be local")); }
52-57: Simplify success report messageSince numerator == denominator on success, a single count reads cleaner.
- format!( - "Round-trip test PASSED: {}/{} blocks verified successfully", - self.num_blocks(), - self.num_blocks() - ) + format!("Round-trip test PASSED: {} blocks verified successfully", self.num_blocks())lib/llm/src/block_manager/v2/physical/transfer/tests/local_transfers.rs (3)
106-107: Remove redundant import.
TransferOptionsis already imported at the top; this inneruseis unnecessary.- use crate::block_manager::v2::physical::transfer::TransferOptions; -
195-236: Optional: skip GDS test when backend or device isn’t available.You already guard on
allow_gds. Consider additionally short‑circuiting when no CUDA device is present to avoid CI flakes.
282-309: Optional: reduce flakiness for large-count test.Either gate by env (e.g., KVBM_STRESS=1) or mark
#[ignore]by default to avoid OOM/timeout on CI without GPUs.lib/llm/src/block_manager/v2/physical/transfer/notifications/nixl_events.rs (3)
32-55: Duplicate slow-transfer checker; prefer single source of truthThis re-implements check_and_warn_slow_transfer, which already exists in notifications/mod.rs. Please expose the shared helper and reuse it here to avoid divergence.
Minimal change:
- In lib/llm/src/block_manager/v2/physical/transfer/notifications/mod.rs, make the helper pub(super).
- Import and use it here; remove the local copy.
-/// Helper function to check if a transfer should be warned about and log the warning. -/// Returns the new last_warned_at time if a warning was issued. -fn check_and_warn_slow_transfer( ... ) -> Option<Instant> { ... } +use super::check_and_warn_slow_transfer;
112-151: Use a set for O(1) membership checks; reduce per-tick overheadcompleted is a Vec and is scanned with contains() for every outstanding UUID, producing O(N²) behavior. Use HashSet.
- let mut completed = Vec::new(); + use std::collections::HashSet; + let mut completed = HashSet::new(); @@ - if outstanding.contains_key(¬if_uuid) { - completed.push(notif_uuid); + if outstanding.contains_key(¬if_uuid) { + completed.insert(notif_uuid); @@ - for (uuid, transfer) in outstanding.iter_mut() { - if !completed.contains(uuid) { + for (uuid, transfer) in outstanding.iter_mut() { + if !completed.contains(uuid) { @@ - for uuid in completed { + for uuid in completed {
64-71: Poll interval likely too aggressive for productioninterval(Duration::from_millis(1)) will wake the task 1000x/sec even when idle. Consider a configurable/backoff strategy (e.g., 10–50 ms, or exponential up to a cap) to reduce CPU churn under load.
lib/llm/src/block_manager/v2/physical/transfer/checksum.rs (1)
145-149: Avoid narrowing cast of file descriptor; use RawFdfd is u64 but File::from_raw_fd expects RawFd (i32 on Unix). Casting risks truncation if higher bits are set. Prefer RawFd throughout or ensure safe conversion here.
- let mut file = ManuallyDrop::new(unsafe { File::from_raw_fd(fd as i32) }); + use std::os::fd::RawFd; + let raw: RawFd = fd as RawFd; + let mut file = ManuallyDrop::new(unsafe { File::from_raw_fd(raw) });lib/kvbm-kernels/python/tests/test_tensor_kernels.py (4)
101-117: Use zip(strict=True) to catch length mismatchesApply strict=True per B905 hints.
- for produced, expected in zip(outputs, universals): + for produced, expected in zip(outputs, universals, strict=True): @@ - for produced_set, expected_set in zip(blocks, expected_blocks): - for produced, expected in zip(produced_set, expected_set): + for produced_set, expected_set in zip(blocks, expected_blocks, strict=True): + for produced, expected in zip(produced_set, expected_set, strict=True):
157-176: Add zip(strict=True) in operational checks- for operational, ref_blocks in zip(operationals, reference_blocks): + for operational, ref_blocks in zip(operationals, reference_blocks, strict=True): @@ - for produced_set, expected_set in zip(blocks, reference_blocks): - for produced, expected in zip(produced_set, expected_set): + for produced_set, expected_set in zip(blocks, reference_blocks, strict=True): + for produced, expected in zip(produced_set, expected_set, strict=True):
36-48: Silence unused unpacked vars (RUF059)nh/nt/hd are not used; unpack with underscores.
- nh, nl, no, nt, hd = universal.shape + _nh, nl, no, _nt, _hd = universal.shape
262-269: Empty-batch tests require no CUDA context creation in bindingsCurrent bindings create a CUDA context before checking for empty inputs. Either:
- Skip these tests when CUDA is unavailable, or
- Reorder bindings to early-return on empty input before context creation (preferred; see binding review).
If you keep the tests, add the skip for CPU-only:
def test_empty_batch_noop(): + if not torch.cuda.is_available(): + pytest.skip("CUDA required for these tests")lib/kvbm-kernels/src/python.rs (1)
806-812: Module name vs. testsThe module is exported as cuda_tensor_kernels. Tests currently import dynamo_kvbm_kernels. Align tests (preferred), or rename the #[pymodule] here to match packaging.
lib/llm/src/block_manager.rs (1)
23-23: Consider feature-gating the v2 surface.If v2 is still evolving or adds build time for users who don’t need it, gate
pub mod v2;behind a cargo feature (e.g.,kvbm-v2). This keeps the public surface lean by default.lib/llm/src/block_manager/v2/memory/tests.rs (2)
68-107: Gate CUDA-dependent tests to avoid CI flakes.These require a CUDA-capable runner. Gate them behind a feature (or ignore by default).
-mod cuda_tests { +#[cfg(all(feature = "cuda", not(miri)))] +mod cuda_tests {Optionally mark each test:
- fn test_pinned_storage() { + #[cfg_attr(not(feature = "cuda-tests"), ignore = "requires CUDA-enabled runner")] + fn test_pinned_storage() {
121-127: Skip or feature-gate NIXL registration test.This hits real NIXL paths and will fail without a running agent/environment.
- #[test] - fn test_nixl_registration() { + #[test] + #[cfg_attr(not(feature = "nixl-tests"), ignore = "requires NIXL setup")] + fn test_nixl_registration() {lib/llm/src/block_manager/v2/physical/manager/remote.rs (1)
22-57: API parity looks solid; consider dropping allow(dead_code).RemoteLayout mirrors LocalLayout accessors and semantics. When this type gets referenced, remove #[allow(dead_code)] to let the compiler catch drift early. Otherwise LGTM.
lib/llm/src/block_manager/v2/physical/transfer/tests/cuda_kernel_tests.rs (3)
6-7: Avoid super re-export coupling; import NixlAgent from the crate path.This makes the tests robust to module re-orgs.
- use super::{NixlAgent, create_test_agent}; + use super::create_test_agent; + use crate::block_manager::v2::physical::transfer::nixl_agent::NixlAgent;
12-33: Deduplicate layout helpers; reuse shared test utilities.create_fc_layout/create_lw_layout here duplicate tests/mod.rs. Prefer
use super::{create_fc_layout, create_lw_layout}and remove the local copies to keep one source of truth.Also applies to: 35-56
58-96: Gate/skip GPU‑dependent tests for non‑CUDA environments.These unwrap()s will fail on CPU-only CI. Consider
#[cfg(feature = "cuda")]at module level or mark with#[ignore]and document how to run them (or add a runtime early-return when CUDA is unavailable).Also applies to: 98-134, 136-172, 174-210
lib/llm/src/block_manager/v2/physical/transfer/notifications/cuda_event.rs (1)
36-87: Gate/skip CUDA‑dependent test.Mark the test module or test with a CUDA feature/ignore to avoid failures on CPU‑only CI.
lib/llm/src/block_manager/v2/memory/torch.rs (2)
4-21: Consider explicit Cpu variant for clarity.Using
Cpuinstead ofOther("cpu")simplifies matching and avoids stringly-typed code. KeepOtherfor extension if needed.
23-29: Prefer usize for data_ptr to match pointer width.Returning
usizealigns with Rust pointer-sized integers and avoids casts on non-64-bit targets. Convert to u64 only at FFI boundaries.lib/llm/src/block_manager/v2/physical/transfer/notifications/notification.rs (2)
47-56: Safer field projection inpollProject to the field explicitly before pinning; this avoids future misuse if a non‑
Unpinfield is added.fn poll( mut self: std::pin::Pin<&mut Self>, cx: &mut std::task::Context<'_>, ) -> std::task::Poll<Self::Output> { - use std::pin::Pin; - Pin::new(&mut self.status).poll(cx).map(|result| { + let this = &mut *self; + std::pin::Pin::new(&mut this.status).poll(cx).map(|result| { result .map_err(|_| anyhow::anyhow!("Transfer handler dropped before completion")) .and_then(|r| r) }) }
18-28: Add a constructor that returns the sender + handleMany call sites will need to drive completion/errors. Provide a pair constructor.
impl TransferCompleteNotification { + /// Create a pending notification and its sender. + pub fn pair() -> (tokio::sync::oneshot::Sender<anyhow::Result<()>>, Self) { + let (tx, rx) = oneshot::channel(); + (tx, Self { status: rx }) + }lib/kvbm-kernels/Cargo.toml (1)
14-22: Feature wiring for PyO3Optional: If this crate will be consumed as a Python extension frequently, consider enabling
abi3for wider compatibility, or document thatpython-bindingsrequires a matching Python at build time.Example:
pyo3 = { version = "0.26", optional = true, features = ["extension-module", "abi3-py39"] }lib/llm/src/block_manager/v2/physical/layout/tests.rs (3)
144-148: Brittle JSON version assertionHard-coding
"\"version\":1"will fail on the next version bump. Compare againstLayoutDescriptor::CURRENT_VERSION.- assert!(json.contains("\"version\":1")); + assert!(json.contains(&format!("\"version\":{}", LayoutDescriptor::CURRENT_VERSION)));
104-131: Agent dependency in unit testsCreating a real
NixlAgentcan make unit tests environment‑dependent. Consider gating with a feature flag or using an in‑crate test helper that skips if backends aren’t available.
- Option A: Wrap with
#[cfg(feature = "nixl-tests")].- Option B: Use a helper that attempts
require_backends(..., &[])andcargo:warning!/test skipon failure. Based on relevant snippets in transfer tests.
321-361: Minor: keep descriptor construction coherentYou build
required_sizebut initially setmemory_descriptorsto empty, then fix later. Consider constructing the supported case directly to reduce branches in the test.No functional change; purely to simplify the test.
lib/llm/src/block_manager/v2/physical/manager/local.rs (1)
24-59:#[allow(dead_code)]likely unnecessaryPublic methods are used by tests and will be referenced by callers; the blanket
#[allow(dead_code)]can be removed to avoid masking unused code elsewhere.-#[allow(dead_code)] impl LocalLayout {lib/llm/src/block_manager/v2/physical/manager/handle.rs (1)
23-26: Add range checks to protect reserved bitsValidate inputs so future use of reserved bits can rely on invariants.
pub fn new(worker_id: u64, layout_id: u16) -> Self { - let handle = (worker_id as u128) | ((layout_id as u128) << 64); + debug_assert_eq!(worker_id >> 64, 0, "worker_id exceeds 64 bits"); + debug_assert_eq!(layout_id >> 16, 0, "layout_id exceeds 16 bits"); + // Bits 80..=127 are reserved and must remain zero. + let handle = (worker_id as u128) | ((layout_id as u128) << 64); Self(handle) }lib/llm/src/block_manager/v2/physical/transfer/preferences.rs (2)
16-27: Future‑proof the enumConsider marking
NativeVsNixlPolicyas#[non_exhaustive]to allow adding policies without breaking downstream consumers.
34-40: Make TransferPreferences CopyAll fields are
Copy; derivingCopysimplifies usage in hot paths.-#[derive(Debug, Clone, Serialize, Deserialize)] +#[derive(Debug, Clone, Copy, Serialize, Deserialize)] pub struct TransferPreferences {lib/llm/src/block_manager/v2/physical/transfer/options.rs (2)
23-31: Derive Debug for easier troubleshootingDebug on config structs improves logging/traceability during failures.
-#[derive(Clone, Default, Builder)] +#[derive(Clone, Default, Builder, Debug)] pub struct TransferOptions {
56-63: Validate layer_range upfront (or at dispatch)Range bounds are not validated here. Ensure
options.layer_rangeis checked againstnum_layersin the executor before use to avoid late failures.Would you like me to wire a bounds check in
executor::execute_transferright aftervalidate_block_transfer?lib/llm/src/block_manager/v2/physical/transfer/fill.rs (2)
107-113: FD typing: prefer RawFd/OwnedFd over u64 cast
fd as i32can truncate on unusual platforms. Keep file-descriptor types asRawFd/OwnedFdend-to-end to avoid casts and lifetime confusion (and to drop ManuallyDrop).
200-272: Tests exercise only System pathDevice and Disk branches aren’t tested; the above bugs slipped through.
Add unit tests for:
- Device: H2D fill validates contents via D2H memcpy.
- Disk: write/read using the same FD with aligned sizes.
I can scaffold these tests with feature gates for CUDA/NIXL.lib/llm/src/block_manager/v2/physical/transfer/executor/mod.rs (3)
91-103: Minor: avoidOk(...?)pattern
Ok(cuda::execute_cuda_transfer(...)?)is noisy. Just return the inner call.- | TransferStrategy::CudaBlockingD2H => Ok(cuda::execute_cuda_transfer( + | TransferStrategy::CudaBlockingD2H => cuda::execute_cuda_transfer( src, dst, src_block_ids, dst_block_ids, layer_range, strategy, ctx, - )?), + ),
272-301: Dead code: TransferNotification is unusedThis local notification type isn’t referenced; keep only
TransferCompleteNotificationto reduce API surface and imports.- pub struct TransferNotification { - status: Arc<AtomicBool>, - } - ... - impl TransferNotification { ... }Also drop the
Arc/AtomicBool/Orderingimports.
182-273: Constructing TransferCompleteNotification via struct literal may break visibilityIf
TransferCompleteNotificationfields are private,TransferCompleteNotification { status: rx }won’t compile. Prefer a constructor, e.g.,TransferCompleteNotification::from_oneshot(rx).- Ok(TransferCompleteNotification { status: rx }) + Ok(TransferCompleteNotification::from_oneshot(rx))If such a constructor doesn’t exist, I can add one in the notifications module.
lib/kvbm-kernels/src/tensor_kernels.rs (1)
112-140: Optional: early null/zero fast-path guards in wrappersConsider short-circuiting obvious no-op cases (e.g.,
num_blocks==0ornl==0 || no==0) before launching. Minor perf/defensive coding improvement; CUDA side already tolerates, but avoiding FFI calls is cleaner.lib/llm/src/block_manager/v2/physical/transfer/strategy.rs (2)
248-281: Prefer||over bitwise|for booleans; and consider TwoHop instead of earlyErr
- Use
||for readability and short-circuiting.- Returning
Errfor Disk participation and for Device with RDMA disabled prevents staging. Consider returning a TwoHop plan instead to match the conservative policy.- if matches!(src, StorageKind::Disk(_)) | matches!(dst, StorageKind::Disk(_)) { - return Err(anyhow::anyhow!( - "Neither local nor remote disk transfers are supported over NIXL at this time." - )); - } + if matches!(src, StorageKind::Disk(_)) || matches!(dst, StorageKind::Disk(_)) { + return Ok(TransferPlan::TwoHop { + first: TransferStrategy::NixlWrite, // src→host + bounce_location: StorageKind::Pinned, + second: TransferStrategy::NixlWrite, // host→remote + }); + }- if !capabilities.allow_gpu_rdma - && (matches!(src, StorageKind::Device(_)) || matches!(dst, StorageKind::Device(_))) - { - return Err(anyhow::anyhow!( - "GPU RDMA is disabled - this transfer requires GPU RDMA." - )); - } + if !capabilities.allow_gpu_rdma + && (matches!(src, StorageKind::Device(_)) || matches!(dst, StorageKind::Device(_))) + { + return Ok(TransferPlan::TwoHop { + first: TransferStrategy::CudaAsyncD2H, + bounce_location: StorageKind::Pinned, + second: TransferStrategy::NixlWrite, + }); + }
283-506: Add tests that exerciseselect_strategy(not onlyselect_direct_strategy)Current tests validate
select_direct_strategypaths but don’t coverselect_strategybehavior for remote. Add cases for Device→Remote with/without RDMA and Disk→Remote to lock intended semantics.lib/llm/src/block_manager/v2/memory/system.rs (2)
33-47: Use runtime page size (sysconf) instead of hard-coded 4096Avoid assumptions; query
_SC_PAGESIZEand fall back to 4096 if unavailable.- // We need 4KB alignment here for NIXL disk transfers to work. + // We need page alignment here for NIXL disk transfers (O_DIRECT). // The O_DIRECT flag is required for GDS. // However, a limitation of this flag is that all operations involving disk // (both read and write) must be page-aligned. // Pinned memory is already page-aligned, so we only need to align system memory. - // TODO(jthomson04): Is page size always 4KB? + // Query the OS page size to avoid assuming 4KB. - // SAFETY: malloc returns suitably aligned memory or null on failure. - let result = unsafe { libc::posix_memalign(&mut ptr, 4096, len) }; + // SAFETY: posix_memalign returns aligned memory or non-zero error code. + let page_size = unsafe { libc::sysconf(libc::_SC_PAGESIZE) }; + let align = if page_size > 0 { page_size as usize } else { 4096 }; + let result = unsafe { libc::posix_memalign(&mut ptr, align, len) };
40-43: Comment nit: refer toposix_memalign, notmallocThe safety comment says “malloc returns…” but you’re using
posix_memalign.lib/llm/src/block_manager/v2/physical/layout/config.rs (2)
82-92: Doc fix: function signature isn’tOption<usize>Tweak wording to match the signature.
-/// Validation function for Option<usize> to check if it's Some(power_of_2). +/// Validation function to ensure `alignment` is a power of two. pub fn validate_power_of_2(alignment: usize) -> Result<(), ValidationError> {
94-101: Clarify error message for dtype width boundsEight bytes are allowed; update message accordingly.
- "dtype_width_bytes_must_be_power_of_two_and_less_than_8_bytes", + "dtype_width_bytes_must_be_power_of_two_and_at_most_8_bytes",lib/llm/src/block_manager/v2/memory/disk.rs (1)
204-368: Remove commented-out dead code.This large block of commented-out code (165 lines) for an alternative memmapped implementation should be removed. If this implementation is needed for future reference, it can be retrieved from git history. Dead code increases maintenance burden and file size.
lib/llm/src/block_manager/v2/physical/layout/physical.rs (2)
159-169: Prefer constructor over struct literal for MemoryDescriptorUse MemoryDescriptor::new(...) to avoid brittle field ordering and match style elsewhere.
- .map(|region| MemoryDescriptor { - addr: region.addr(), - size: region.size(), - }) + .map(|region| MemoryDescriptor::new(region.addr(), region.size()))
119-123: Avoid leaking Arc in accessorExpose &dyn Layout instead of &Arc to reduce coupling. Keep an additional method if callers truly need Arc.
- pub fn layout(&self) -> &Arc<dyn Layout> { - &self.layout - } + pub fn layout(&self) -> &dyn Layout { + &*self.layout + }lib/llm/src/block_manager/v2/memory/device.rs (2)
35-37: Document safety for Send/SyncAdd a brief comment explaining why DeviceStorage is safe to Send/Sync (opaque device pointer, no aliasing host access; all frees rebind context).
87-95: Consider storing pointer as usize or NonZeroU64This avoids repeated casts and encodes non-null guarantee. Not mandatory.
lib/llm/src/block_manager/v2/physical/transfer/notifications/mod.rs (1)
80-83: Reduce CPU overhead: make polling interval configurable or adaptiveA fixed 1 ms tick can spin hot under load. Expose a Duration parameter or implement backoff (e.g., 1–10 ms when pending, longer when idle).
Example signature:
pub async fn process_polling_notifications<C: CompletionChecker>( mut rx: mpsc::Receiver<RegisterPollingNotification<C>>, poll_every: Duration, ) { /* ... */ }Or compute an adaptive interval based on outstanding size and last completion.
lib/llm/src/block_manager/v2/physical/manager/metadata.rs (2)
188-196: Remove duplicate assert in testLine 193 repeats the same assert!(!packed.is_empty()).
- assert!(!packed.is_empty()); - assert!(!packed.is_empty()); + assert!(!packed.is_empty());
51-59: Future-proof the envelope with a version fieldConsider adding a version to RdmaLayoutDescriptors to allow backward-compatible changes without breaking decode.
lib/llm/src/block_manager/v2/physical/transfer/executor/cuda.rs (2)
186-223: Asynchronous memcpy with pageable host memory may blockhtod/dtoh async only stays async with pinned host memory; otherwise, CUDA may serialize. Ensure host-side regions are pinned or document that async semantics may degrade.
Would you like me to add a StorageKind::Pinned check here and switch to blocking copies otherwise?
Also applies to: 226-262
64-69: D2D on H2D stream is a bit oddUsing the H2D stream for D2D is functional but non-obvious. Prefer a dedicated D2D stream or a “transfer” stream for all directions for clarity.
lib/llm/src/block_manager/v2/physical/transfer/validation.rs (1)
142-151: Prefer trait method over config() for num_blocksAccessing config() ties to a concrete layout; use the trait surface to reduce coupling.
- let max_blocks = layout.layout().config().num_blocks; + let max_blocks = layout.layout().num_blocks();Based on learnings
lib/llm/src/block_manager/v2/physical/manager/mod.rs (1)
63-66: Builder doc mismatch: returning TransferConfigBuilder doesn’t produce TransportManagerDocs show .build()? on TransportManager::builder(), but this returns TransferConfigBuilder. Ensure the builder exposes a build_manager() -> Result or similar, or fix the docs.
I can wire TransferConfigBuilder::build_manager(self) -> Result that calls TransportManager::from_context.
lib/llm/src/block_manager/v2/physical/layout/serialize.rs (1)
90-99: Consider validating version on (de)serialization callersLayoutDescriptor::from_* doesn’t check version; ensure callers verify descriptor.version == CURRENT_VERSION before reconstructing layouts, as indicated in physical.rs. If not, add a guard here.
lib/llm/src/block_manager/v2/physical/layout/mod.rs (1)
24-39: Commented-out registration code: either gate or removeThe commented exports linger. Either add a feature-gate or remove to keep the public surface crisp.
lib/llm/src/block_manager/v2/physical/transfer/nixl_agent/mod.rs (4)
66-77: Use structured logging instead of eprintln!Library code shouldn’t print to stderr. Use tracing::warn! (consistent with other modules) for failures.
- eprintln!( - "✗ Failed to create {} backend: {}. Operations requiring this backend will fail.", - backend_upper, e - ); + tracing::warn!( + backend = %backend_upper, + error = %e, + "Failed to create backend; operations requiring this backend will fail" + ); ... - eprintln!( - "✗ No {} plugin found. Operations requiring this backend will fail.", - backend_upper - ); + tracing::warn!( + backend = %backend_upper, + "No plugin found; operations requiring this backend will fail" + );And similarly for lines 127–135.
Also applies to: 127-135
59-60: Normalize with ASCII uppercasing to avoid locale surprisesPlugin names are ASCII; prefer to_ascii_uppercase(). Also avoid allocating where possible by normalizing once.
- let backend_upper = backend.to_uppercase(); + let backend_upper = backend.to_ascii_uppercase(); ... - let backend_upper = backend.to_uppercase(); + let backend_upper = backend.to_ascii_uppercase(); ... - self.available_backends.contains(&backend.to_uppercase()) + self.available_backends.contains(&backend.to_ascii_uppercase())Also applies to: 120-121, 178-181
81-84: Clarify behavior when backends is emptynew_with_backends() errors if none succeed; build() in TransferContext hits an empty branch. Either:
- Special‑case empty to create an agent without backends, or
- Expose a NixlAgent::new(name) that doesn’t initialize backends and call that from the empty branch.
Suggest minimal change: treat empty list as “no init” and do not error.
pub fn new_with_backends(name: &str, backends: &[&str]) -> Result<Self> { let agent = RawNixlAgent::new(name)?; let mut available_backends = HashSet::new(); + if backends.is_empty() { + return Ok(Self { agent, available_backends }); + } + for backend in backends { ... } if available_backends.is_empty() { anyhow::bail!("Failed to initialize any NIXL backends from {:?}", backends); }Then the TransferContext empty branch remains valid.
Also applies to: 109-115
236-245: Tests assume UCX; make them resilient or conditionaltest_require_backend will hard‑fail on hosts without UCX. Consider skipping when unavailable or gating by env/feature to avoid flaky CI.
- fn test_require_backend() { - let agent = NixlAgent::new_with_backends("test", &["UCX"]).expect("Need UCX for test"); + fn test_require_backend() { + let agent = match NixlAgent::new_with_backends("test", &["UCX"]) { + Ok(a) => a, + Err(_) => return, // skip if UCX not available + };Also applies to: 248-257
lib/llm/src/block_manager/v2/memory/registered.rs (3)
7-8: Avoid type name collision; alias raw agent as RawNixlAgentThis module’s NixlAgent alias can be confused with the public wrapper type. Rename locally for clarity.
-use nixl_sys::{Agent as NixlAgent, MemType, OptArgs, RegistrationHandle}; +use nixl_sys::{Agent as RawNixlAgent, MemType, OptArgs, RegistrationHandle}; ... -pub fn register_with_nixl<S>( +pub fn register_with_nixl<S>( storage: S, - agent: &NixlAgent, + agent: &RawNixlAgent, opt: Option<&OptArgs>, ) -> std::result::Result<NixlRegistered<S>, S>Note: call sites using the wrapper NixlAgent still work via Deref coercion.
Also applies to: 168-175
99-101: Downcasting: forward as_any to inner storageEnable downcasting to the original concrete storage type, not the wrapper.
- fn as_any(&self) -> &dyn Any { - self - } + fn as_any(&self) -> &dyn Any { + self.storage.as_any() + }
187-194: Return error details on registration failureCurrent API loses the error cause. Consider returning (S, NixlError) to aid debugging, or wrap in StorageError.
- match agent.register_memory(&descriptor, opt) { - Ok(handle) => Ok(NixlRegistered { ... }), - Err(_) => Err(storage), - } + match agent.register_memory(&descriptor, opt) { + Ok(handle) => Ok(NixlRegistered { storage, handle: Some(handle), agent_name: agent.name().to_string() }), + Err(e) => Err(storage), // consider Err((storage, e)) in a follow-up API tweak + }lib/llm/src/block_manager/v2/memory/mod.rs (2)
162-169: Minor: guard against addr overflowaddr = base.addr() + offset can overflow usize. Although bounds are checked against base.size(), add checked_add in debug to be safe.
fn addr(&self) -> usize { - self.base.addr() + self.offset + self.base.addr().checked_add(self.offset) + .expect("address overflow in OffsetMemoryRegion::addr") }
180-184: QoL: make MemoryDescriptor Copy + accessors inlineLightweight descriptor benefits from Copy.
-#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)] +#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)] pub struct MemoryDescriptor { pub addr: usize, pub size: usize, }(Already derives Eq; ensure Copy is present in derive list as shown.)
Also applies to: 186-200
lib/llm/src/block_manager/v2/physical/transfer/context.rs (2)
109-115: Empty backend branch: prefer a clear constructorIf no backends are configured, call NixlAgent::new_default(name) or (after adopting earlier suggestion) NixlAgent::new(name) to avoid passing an empty list to new_with_backends().
- let nixl_agent = if backend_names.is_empty() { - // No backends configured - create agent without backends - NixlAgent::new_with_backends(&agent_name, &[])? - } else { + let nixl_agent = if backend_names.is_empty() { + NixlAgent::new_default(&agent_name)? + } else { // Create agent with requested backends NixlAgent::new_with_backends(&agent_name, &backend_names)? };
239-253: Dropped notifications on full channelstry_send will silently drop when buffers fill. If losing notifications is unacceptable, switch to send(). If best-effort is intended, at least log drops at debug level.
- let _ = self.tx_nixl_status.try_send(notification); + if let Err(e) = self.tx_nixl_status.try_send(notification) { + tracing::debug!(error = %e, "Dropped NIXL status notification (channel full)"); + }Repeat for CUDA and NIXL event paths.
Also applies to: 319-323, 338-342, 363-366
lib/kvbm-kernels/cuda/tensor_kernels.cu (2)
201-205: Remove unnecessary const_cast for readability.
block_chunksalready has typeT* const*. The cast adds noise.- T* const* block_base = const_cast<T* const*>(block_chunks + block_idx * block_stride); + T* const* block_base = block_chunks + block_idx * block_stride;
283-289: Guard against size_t overflow when computing total_per_block.
nh*nl*no*nt*hdcan overflow on large shapes. Add checked multiply (host-side) and returncudaErrorInvalidValueon overflow.You can implement a tiny helper using
__builtin_mul_overflow(GCC/Clang) or a conservative max bound check before launch. Please confirm target compilers support it across your CI matrix. If helpful, I can draft a portable helper.Also applies to: 344-350
lib/llm/src/block_manager/v2/physical/layout/builder.rs (2)
549-559: Confirm NIXL semantics for File descriptors (offset handling).
derive_descriptorskipsaddradjustment forMemType::File. If NIXL expects per-view offsets encoded inaddr, this will make all views point to offset 0. If offset is conveyed elsewhere, all good—please confirm.If offsets must be applied, remove the
if mem_type != MemType::Fileguard and always adjustaddr, or add a dedicatedfile_offsetfield if NIXL supports it.
224-231:numa_awareflag currently unused.The
allocate_pinned(numa_aware: bool)parameter is ignored in allocation. Either wire it to a NUMA policy or document that it’s reserved.
- Pass the flag into the pinned allocator (e.g., pick allocation node or advise OS).
- If not planned soon, consider removing the parameter to avoid API confusion.
Also applies to: 406-410
lib/llm/src/block_manager/v2/physical/transfer/tests/mod.rs (1)
101-113: Docstring mismatch forcreate_test_agent.Comment says it “attempts to initialize UCX, GDS, POSIX,” but it calls
require_backends(name, &[])(no backends). Update comment or callnew_with_backends.
lib/bindings/python/Cargo.toml
Outdated
|
|
||
| dlpark = { version = "0.5", features = ["pyo3", "half"], optional = true } | ||
| cudarc = { version = "0.16.2", features = ["cuda-12020"], optional = true } | ||
| cudarc = { workspace = true, optional = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Manifest parse error: cannot inherit cudarc from workspace here.
This crate is its own workspace (intentionally excluded from the root), so workspace = true has no workspace.dependencies to inherit from. Pin cudarc directly (or add a local [workspace.dependencies]).
Apply one of the following:
- Simple fix (recommended):
-cudarc = { workspace = true, optional = true }
+cudarc = { version = "0.17.1", features = ["cuda-12020"], optional = true }- Alternative (keep inheritance but define locally):
[workspace.dependencies]
cudarc = { version = "0.17.1", features = ["cuda-12020"] }Pipeline error reference: “error inheriting 'cudarc' from workspace root manifest… 'workspace.dependencies' was not defined.” (This crate’s workspace is empty).
🤖 Prompt for AI Agents
In lib/bindings/python/Cargo.toml around line 76: the dependency entry uses
`workspace = true` but this crate is in its own (empty) workspace so it cannot
inherit `cudarc`; replace the inheritance by pinning `cudarc` directly (remove
`workspace = true` and specify a concrete version and any needed features, e.g.
version and features used elsewhere), or alternatively add a local
`[workspace.dependencies]` in this crate's workspace root and declare `cudarc`
there with the required version and features so the dependency can be inherited.
| fn main() { | ||
| println!("cargo:rerun-if-changed=cuda/tensor_kernels.cu"); | ||
|
|
||
| let mut build = cc::Build::new(); | ||
| build | ||
| .cuda(true) | ||
| .file("cuda/tensor_kernels.cu") | ||
| .flag("-std=c++17") | ||
| .flag("-Xcompiler") | ||
| .flag("-fPIC"); | ||
|
|
||
| if let Ok(arch_list) = std::env::var("CUDA_ARCHS") { | ||
| for arch in arch_list.split(',') { | ||
| let arch = arch.trim(); | ||
| if arch.is_empty() { | ||
| continue; | ||
| } | ||
| build.flag(format!("-gencode=arch=compute_{arch},code=sm_{arch}")); | ||
| } | ||
| } else { | ||
| // Default to Ampere (SM 80) and Hopper (SM 90) support. | ||
| build.flag("-gencode=arch=compute_80,code=sm_80"); | ||
| build.flag("-gencode=arch=compute_90,code=sm_90"); | ||
| } | ||
|
|
||
| build.compile("tensor_kernels"); | ||
|
|
||
| println!("cargo:rustc-link-lib=cudart"); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compile error: cc::Build::flag expects &str, not String.
- Use
&format!(...)for-gencodeflags. Current code won’t compile. - Add
rerun-if-env-changed=CUDA_ARCHS. - Add SPDX header (pipeline blocker).
-fn main() {
+fn main() {
+ // SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ // SPDX-License-Identifier: Apache-2.0
println!("cargo:rerun-if-changed=cuda/tensor_kernels.cu");
+ println!("cargo:rerun-if-env-changed=CUDA_ARCHS");
@@
- build.flag(format!("-gencode=arch=compute_{arch},code=sm_{arch}"));
+ build.flag(&format!("-gencode=arch=compute_{arch},code=sm_{arch}"));
}
} else {
// Default to Ampere (SM 80) and Hopper (SM 90) support.
build.flag("-gencode=arch=compute_80,code=sm_80");
build.flag("-gencode=arch=compute_90,code=sm_90");
+ // Optional: include forward-compat PTX (commented; enable if needed)
+ // build.flag("-gencode=arch=compute_90,code=compute_90");
}🧰 Tools
🪛 GitHub Actions: Copyright Checks
[error] 1-1: Invalid/Missing Header: lib/kvbm-kernels/build.rs
🤖 Prompt for AI Agents
lib/kvbm-kernels/build.rs lines 1-29: the build script fails to compile because
cc::Build::flag expects &str but you pass a String from format!, you also need
to ensure Cargo rebuilds when CUDA_ARCHS changes and include the SPDX header;
fix by adding an SPDX header comment at the top of the file, change
build.flag(format!(...)) to build.flag(&format!(...)) (or assign to a &str) for
each -gencode entry, add a println!("cargo:rerun-if-env-changed=CUDA_ARCHS");
near the top (after the existing rerun-if-changed line) so the build reruns when
that env var changes, and keep the rest of the logic unchanged.
| @@ -0,0 +1,24 @@ | |||
| [package] | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File marked executable without shebang (pipeline failure)
Clear the executable bit on this file.
Run:
git update-index --chmod=-x lib/kvbm-kernels/Cargo.tomlI can add this to the PR if you prefer.
🧰 Tools
🪛 GitHub Actions: Copyright Checks
[error] 1-1: Invalid/Missing Header: lib/kvbm-kernels/Cargo.toml
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/3861/merge) by jthomson04.
[error] 1-1: check-executables-have-shebangs: Cargo.toml is marked executable but has no (or invalid) shebang!
🤖 Prompt for AI Agents
lib/kvbm-kernels/Cargo.toml lines 1-1: the file has the executable bit set
causing CI to fail; clear the executable permission on
lib/kvbm-kernels/Cargo.toml (use your Git client to unset the executable bit,
e.g. run git update-index --chmod=-x lib/kvbm-kernels/Cargo.toml), then stage
and commit the permission change and push the update.
Missing SPDX header (pipeline failure)
Add SPDX header comments at the top (TOML comments use #). This will clear the copyright check.
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
[package]
name = "dynamo-kvbm-kernels"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| [package] | |
| # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
| # SPDX-License-Identifier: Apache-2.0 | |
| [package] | |
| name = "dynamo-kvbm-kernels" |
🧰 Tools
🪛 GitHub Actions: Copyright Checks
[error] 1-1: Invalid/Missing Header: lib/kvbm-kernels/Cargo.toml
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/3861/merge) by jthomson04.
[error] 1-1: check-executables-have-shebangs: Cargo.toml is marked executable but has no (or invalid) shebang!
🤖 Prompt for AI Agents
In lib/kvbm-kernels/Cargo.toml around lines 1 to 1, the file is missing the SPDX
header required by the pipeline; add one or more TOML comments at the very top
using '#' such as '# SPDX-License-Identifier: <LICENSE>' (replace <LICENSE> with
the project's license identifier, e.g. Apache-2.0 or MIT) and optionally a
copyright comment line, ensuring the SPDX comment appears before the [package]
section so the copyright check passes.
| [build-system] | ||
| requires = ["maturin>=1.5,<2"] | ||
| build-backend = "maturin" | ||
|
|
||
| [project] | ||
| name = "cuda-tensor-kernels" | ||
| version = "0.1.0" | ||
| description = "CUDA tensor permutation and packing kernels with Python bindings" | ||
| requires-python = ">=3.8" | ||
| readme = "README.md" | ||
| authors = [{ name = "Ryan", email = "[email protected]" }] | ||
| license = { text = "Proprietary" } | ||
| classifiers = [ | ||
| "Programming Language :: Rust", | ||
| "Programming Language :: Python", | ||
| "Programming Language :: Python :: 3", | ||
| "Programming Language :: Python :: 3 :: Only", | ||
| "Operating System :: POSIX :: Linux", | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Missing SPDX header and likely license mismatch.
- Add SPDX header (pipeline blocker).
- Consider aligning license with workspace (Apache-2.0 appears used elsewhere).
+# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
[build-system]
requires = ["maturin>=1.5,<2"]
build-backend = "maturin"
@@
authors = [{ name = "Ryan", email = "[email protected]" }]
-license = { text = "Proprietary" }
+license = { text = "Apache-2.0" }
classifiers = [If the workspace uses a different license, replace the value accordingly.
🏁 Script executed:
#!/bin/bash
# Check for SPDX headers in pyproject.toml files
echo "=== Searching for SPDX headers in .toml files ==="
fd -e toml -x grep -l "SPDX-" {} \;
echo ""
echo "=== All pyproject.toml files in repo ==="
fd -n "pyproject.toml" -x cat {} \; -x echo "---FILE: {}---" \;
echo ""
echo "=== Check for LICENSE file ==="
fd -n "LICENSE*" -type f
echo ""
echo "=== Check root for license indicators ==="
ls -la | grep -iE "license|copyright|spdx"
echo ""
echo "=== Check for workspace configuration ==="
fd -n "workspace" -e toml
echo ""
echo "=== Check for pre-commit or CI config mentioning headers ==="
fd -e yaml -e yml -e toml | xargs grep -l "SPDX\|header" 2>/dev/null | head -20Length of output: 2572
🏁 Script executed:
#!/bin/bash
# Check for SPDX headers in pyproject.toml files
echo "=== Searching for SPDX headers in .toml files ==="
fd -e toml -x grep -l "SPDX-" {} \;
echo ""
echo "=== All pyproject.toml files in repo ==="
fd -n "pyproject.toml"
echo ""
echo "=== Check for LICENSE file ==="
fd -n "LICENSE*" -type f
echo ""
echo "=== Check root for license indicators ==="
ls -la | grep -iE "license|copyright|spdx"
echo ""
echo "=== Check for pre-commit or CI config mentioning headers ==="
fd -e yaml -e yml | xargs grep -l "SPDX\|header\|copyright" 2>/dev/null | head -20Length of output: 2386
🏁 Script executed:
#!/bin/bash
# Read workspace LICENSE file
echo "=== Workspace LICENSE content (first 50 lines) ==="
head -50 LICENSE
echo ""
echo "=== Example SPDX header from ./pyproject.toml ==="
head -10 ./pyproject.toml
echo ""
echo "=== Example SPDX header from ./benchmarks/pyproject.toml ==="
head -10 ./benchmarks/pyproject.toml
echo ""
echo "=== Check .pre-commit-config or similar ==="
fd -e yaml -e yml | xargs grep -l "reuse\|SPDX\|copyright-header" 2>/dev/null | head -10
echo ""
echo "=== Check for REUSE compliance tooling ==="
fd "LICENSES" -type d
echo ""
echo "=== Grep for Apache-2.0 references in workspace ==="
rg "Apache-2.0" -c --type toml | head -20Length of output: 4903
🏁 Script executed:
#!/bin/bash
# Check if specific file exists
echo "=== Check if lib/kvbm-kernels/pyproject.toml exists ==="
test -f "lib/kvbm-kernels/pyproject.toml" && echo "EXISTS" || echo "DOES NOT EXIST"
echo ""
echo "=== List files in lib/kvbm-kernels/ if it exists ==="
ls -la lib/kvbm-kernels/ 2>/dev/null || echo "Directory does not exist"
echo ""
echo "=== Search for pre-commit config ==="
find . -name ".pre-commit-config.yaml" -o -name "pre-commit-config.yaml" 2>/dev/null | head -5
echo ""
echo "=== Search for pyproject.toml with tool.ruff or linting config ==="
rg "tool.ruff|tool.black|header" pyproject.toml -A 5 | head -50
echo ""
echo "=== Check GitHub Actions workflow for SPDX/header checks ==="
find . -path "./.git" -prune -o -name "*.yml" -o -name "*.yaml" | xargs grep -l "reuse\|SPDX.*header\|copyright.*check" 2>/dev/null | head -10
echo ""
echo "=== Search for any documentation about licensing requirements ==="
fd -e md | xargs grep -l "SPDX\|Apache-2.0\|license" 2>/dev/null | head -10Length of output: 2007
🏁 Script executed:
#!/bin/bash
# Check copyright checks workflow
echo "=== GitHub Actions copyright-checks.yml ==="
cat ./.github/workflows/copyright-checks.yml | head -100
echo ""
echo "=== Pre-commit config ==="
cat ./.pre-commit-config.yaml | head -50
echo ""
echo "=== lib/kvbm-kernels/Cargo.toml (for SPDX format reference) ==="
head -10 lib/kvbm-kernels/Cargo.toml
echo ""
echo "=== lib/kvbm-kernels/pyproject.toml (full file, to verify current state) ==="
cat lib/kvbm-kernels/pyproject.tomlLength of output: 3562
🏁 Script executed:
#!/bin/bash
# Check the copyright check script to see what files it validates
echo "=== copyright-check.ps1 (first 150 lines) ==="
head -150 ./.github/workflows/copyright-check.ps1
echo ""
echo "=== Check if script mentions .toml or pyproject ==="
grep -i "toml\|pyproject" ./.github/workflows/copyright-check.ps1Length of output: 5537
Add missing SPDX header and correct license to match workspace standard.
The copyright-checks workflow enforces SPDX headers on all .toml files as a pipeline blocker. The workspace uses Apache-2.0 (confirmed in LICENSE file), but this file declares "Proprietary" license, conflicting with all other .toml files in the repository.
+# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
[build-system]
requires = ["maturin>=1.5,<2"]
build-backend = "maturin"
@@
authors = [{ name = "Ryan", email = "[email protected]" }]
-license = { text = "Proprietary" }
+license = { text = "Apache-2.0" }
classifiers = [📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| [build-system] | |
| requires = ["maturin>=1.5,<2"] | |
| build-backend = "maturin" | |
| [project] | |
| name = "cuda-tensor-kernels" | |
| version = "0.1.0" | |
| description = "CUDA tensor permutation and packing kernels with Python bindings" | |
| requires-python = ">=3.8" | |
| readme = "README.md" | |
| authors = [{ name = "Ryan", email = "[email protected]" }] | |
| license = { text = "Proprietary" } | |
| classifiers = [ | |
| "Programming Language :: Rust", | |
| "Programming Language :: Python", | |
| "Programming Language :: Python :: 3", | |
| "Programming Language :: Python :: 3 :: Only", | |
| "Operating System :: POSIX :: Linux", | |
| ] | |
| # SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
| # SPDX-License-Identifier: Apache-2.0 | |
| [build-system] | |
| requires = ["maturin>=1.5,<2"] | |
| build-backend = "maturin" | |
| [project] | |
| name = "cuda-tensor-kernels" | |
| version = "0.1.0" | |
| description = "CUDA tensor permutation and packing kernels with Python bindings" | |
| requires-python = ">=3.8" | |
| readme = "README.md" | |
| authors = [{ name = "Ryan", email = "[email protected]" }] | |
| license = { text = "Apache-2.0" } | |
| classifiers = [ | |
| "Programming Language :: Rust", | |
| "Programming Language :: Python", | |
| "Programming Language :: Python :: 3", | |
| "Programming Language :: Python :: 3 :: Only", | |
| "Operating System :: POSIX :: Linux", | |
| ] |
🧰 Tools
🪛 GitHub Actions: Copyright Checks
[error] 1-1: Invalid/Missing Header: lib/kvbm-kernels/pyproject.toml
🤖 Prompt for AI Agents
In lib/kvbm-kernels/pyproject.toml lines 1-19: this file is missing the SPDX
header and incorrectly declares "Proprietary"; add the standard SPDX header at
the top of the file (with SPDX-FileCopyrightText and SPDX-License-Identifier
entries matching the workspace pattern) and change the license metadata to the
workspace standard Apache-2.0 (update the license field to the Apache-2.0 form
used across other .toml files in the repo and adjust any license-related
metadata if present so it matches the repository convention).
| """ | ||
| PyTorch-based regression tests for the CUDA tensor packing kernels. | ||
| The goal is to mirror how an ML engineer would use the library, so the tests | ||
| act as both verification and documentation. | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix CI: add SPDX header at top of file
Pre-commit/CI flags a missing header. Place SPDX lines before the docstring.
+# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
"""
PyTorch-based regression tests for the CUDA tensor packing kernels.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| """ | |
| PyTorch-based regression tests for the CUDA tensor packing kernels. | |
| The goal is to mirror how an ML engineer would use the library, so the tests | |
| act as both verification and documentation. | |
| """ | |
| # SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
| # SPDX-License-Identifier: Apache-2.0 | |
| """ | |
| PyTorch-based regression tests for the CUDA tensor packing kernels. | |
| The goal is to mirror how an ML engineer would use the library, so the tests | |
| act as both verification and documentation. | |
| """ |
🧰 Tools
🪛 GitHub Actions: Copyright Checks
[error] 1-1: Invalid/Missing Header: lib/kvbm-kernels/python/tests/test_tensor_kernels.py
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/3861/merge) by jthomson04.
[warning] 1-1: pre-commit: 1 file reformatted by Black; 1 file reformatted, 307 files left unchanged. See log for details.
🤖 Prompt for AI Agents
In lib/kvbm-kernels/python/tests/test_tensor_kernels.py around lines 1 to 6, the
file is missing the required SPDX header; add the SPDX license identifier and
copyright lines at the very top of the file before the module docstring (e.g.,
SPDX-License-Identifier and Copyright owner lines used project-wide) so the
header appears as the first lines, then keep the existing docstring.
| impl CompletionChecker for NixlStatusChecker { | ||
| fn is_complete(&self) -> Result<bool> { | ||
| // get_xfer_status returns XferStatus enum: | ||
| // - XferStatus::Success means transfer is complete | ||
| // - XferStatus::InProgress means still pending | ||
| match self.agent.get_xfer_status(&self.xfer_req) { | ||
| Ok(status) => Ok(status.is_success()), | ||
| Err(e) => Err(anyhow!("NIXL transfer status check failed: {}", e)), | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not suppress terminal failures; map “in progress” to Ok(false) and failures to Err.
Current logic returns Ok(false) for any non-success status, hiding failures/cancellations. Surface terminal errors so callers can stop polling and react.
Apply this diff to make status handling explicit (adjust method names to your API if different):
impl CompletionChecker for NixlStatusChecker {
fn is_complete(&self) -> Result<bool> {
- match self.agent.get_xfer_status(&self.xfer_req) {
- Ok(status) => Ok(status.is_success()),
- Err(e) => Err(anyhow!("NIXL transfer status check failed: {}", e)),
- }
+ match self.agent.get_xfer_status(&self.xfer_req) {
+ Ok(status) => {
+ if status.is_success() {
+ Ok(true)
+ } else if status.is_in_progress() {
+ Ok(false)
+ } else {
+ Err(anyhow!("NIXL transfer failed: {:?}", status))
+ }
+ }
+ Err(e) => Err(anyhow!("NIXL transfer status check failed: {e}")),
+ }
}
}Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In lib/llm/src/block_manager/v2/physical/transfer/notifications/nixl_status.rs
around lines 23–31, the current implementation treats any non-success status as
Ok(false), which hides terminal failure/cancel states; change the match on
self.agent.get_xfer_status(&self.xfer_req) to explicitly handle enum variants:
return Ok(true) for XferStatus::Success, Ok(false) for XferStatus::InProgress,
and return Err(anyhow!(...)) for terminal failure/cancel variants (e.g.,
XferStatus::Failed, XferStatus::Cancelled) including the status or underlying
error message so callers can stop polling and react. Ensure you use the actual
variant names from the agent API and construct a descriptive error when mapping
terminal statuses to Err.
| pub(crate) fn select_strategy( | ||
| src: &PhysicalLayout, | ||
| dst: &PhysicalLayout, | ||
| ctx: &TransferContext, | ||
| ) -> anyhow::Result<TransferPlan> { | ||
| let is_src_local = src.nixl_metadata().agent_name() == ctx.nixl_agent().name(); | ||
| let is_dst_local = dst.nixl_metadata().agent_name() == ctx.nixl_agent().name(); | ||
|
|
||
| if !is_src_local && !is_dst_local { | ||
| return Err(anyhow::anyhow!( | ||
| "Both src and dst are remote - this is not supported." | ||
| )); | ||
| } | ||
|
|
||
| if is_src_local && is_dst_local { | ||
| return Ok(select_direct_strategy( | ||
| src.location(), | ||
| dst.location(), | ||
| false, | ||
| ctx.capabilities(), | ||
| )); | ||
| } | ||
|
|
||
| select_remote_strategy_v2( | ||
| src.location(), | ||
| is_src_local, | ||
| dst.location(), | ||
| is_dst_local, | ||
| ctx.capabilities(), | ||
| ) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Top-level remote policy diverges from documented fallback
select_strategy delegates remote paths to select_remote_strategy_v2, which returns errors (e.g., Device↔Remote without RDMA) whereas the documented “Conservative Default Policy” (and unit tests) stage via TwoHop. This is a breaking semantic change for callers of execute_transfer.
Two options:
- Restore TwoHop fallback in v2 for parity with docs/tests.
- Keep the stricter v2 policy but update docs, tests (cover
select_strategy), and all upstream callers to handle errors.
Would you like a patch to restore TwoHop for the non-RDMA and Disk cases?
---
🏁 Script executed:
```shell
#!/bin/bash
# Find select_remote_strategy_v2 implementation and understand error handling
rg -n "fn select_remote_strategy_v2" --type rust -A 50
Length of output: 4776
🏁 Script executed:
#!/bin/bash
# Search for "Conservative Default Policy" documentation or comments
rg -n "Conservative Default Policy" --type rustLength of output: 96
🏁 Script executed:
#!/bin/bash
# Find test files for transfer strategy and check what they cover
fd -e rs -path "*/test*" --exec rg -l "select_strategy\|select_direct_strategy" {} \;Length of output: 168
🏁 Script executed:
#!/bin/bash
# Look for TwoHop references in strategy/transfer files
rg -n "TwoHop" lib/llm/src/block_manager/v2/physical/transfer/ --type rust -C 3Length of output: 12105
🏁 Script executed:
#!/bin/bash
# Find execute_transfer call sites to assess potential breaking changes
rg -n "execute_transfer" --type rust -B 2 -A 2Length of output: 10732
🏁 Script executed:
#!/bin/bash
# Check for boolean OR operator issues (| instead of ||) in the strategy file
rg -n " \| " lib/llm/src/block_manager/v2/physical/transfer/strategy.rs --type rustLength of output: 419
---
🏁 Script executed:
```shell
#!/bin/bash
# Get the Conservative Default Policy documentation around line 124
sed -n '120,145p' lib/llm/src/block_manager/v2/physical/transfer/strategy.rs
Length of output: 856
🏁 Script executed:
#!/bin/bash
# Check if there are other tests for select_strategy (not just select_direct_strategy)
rg -n "select_strategy\(" lib/llm/src/block_manager/v2/physical/transfer/strategy.rs -A 5Length of output: 247
Remote transfer policy diverges from documented Conservative Default Policy
select_strategy delegates remote paths to select_remote_strategy_v2, which returns errors for cases the documented policy handles via TwoHop staging:
- Disk transfers (any local-remote disk pair)
- Device ↔ Remote without RDMA enabled
The documented policy (lines 124–140) explicitly stages these through host; select_direct_strategy implements this correctly via TwoHop, but select_remote_strategy_v2 breaks that contract by returning errors instead.
Required fixes:
- Either restore TwoHop fallback in
select_remote_strategy_v2for parity with documented behavior, or update docs/tests to reflect the stricter v2 policy and audit all callers ofexecute_transfer. - Line 258: Fix boolean operator—use
||instead of|:if matches!(src, StorageKind::Disk(_)) || matches!(dst, StorageKind::Disk(_)) {
- Add unit tests directly for
select_strategyto prevent future policy drift (currently onlyselect_direct_strategyis tested).
🤖 Prompt for AI Agents
lib/llm/src/block_manager/v2/physical/transfer/strategy.rs lines 81–111:
select_strategy currently forwards remote cases to select_remote_strategy_v2
which rejects scenarios the documented Conservative Default Policy handles via
TwoHop staging (disk <-> remote and device <-> remote without RDMA); fix by
restoring TwoHop fallback behavior in select_remote_strategy_v2 (or
alternatively update docs/tests and audit all execute_transfer callers to accept
the stricter v2 policy) so remote transfers that can be staged via the host are
routed through select_direct_strategy as before; additionally fix the boolean
operator at line 258 to use || instead of | (i.e., use short-circuit OR when
checking StorageKind::Disk), and add direct unit tests for select_strategy
covering local-local, local-remote (disk), device-remote without RDMA, and
remote-remote error cases to prevent future policy drift.
| let (src_layout, _src_mem) = create_test_layout(4); | ||
| let (inter_layout, _inter_mem) = create_test_layout(4); | ||
| let (dst_layout, _dst_mem) = create_test_layout(4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Tests reference undefined create_test_layout; unit tests won’t compile
create_test_layout(...) is called but not defined/imported in this module. Even ignored async tests still require the symbol to compile. Add a local helper or import an existing one from a shared test utils module.
Run to locate any existing helper you can import:
Also applies to: 293-296, 319-321
🏁 Script executed:
#!/bin/bash
rg -nP '\bcreate_test_layout\s*\(' -C2Length of output: 8156
I need to verify the definition of create_test_layout more thoroughly and check the testing.rs file structure directly.Now I'll run a direct search of the repository to find the function definition or verify its absence:
Let me generate a more focused shell script to comprehensively check for the missing function:
Based on the comprehensive ripgrep search showing multiple call sites but no function definition found, combined with the pattern of the calls, I can now provide a definitive assessment:
Tests reference undefined create_test_layout; compilation will fail
The function create_test_layout(...) is called in multiple test functions (lines 259–360) but is never defined or imported in this module. The ripgrep search returned only call sites, not a definition. Even though async tests with #[tokio::test] are awaited at runtime, the symbol must still exist at compile time.
Fix options:
- Define
create_test_layoutas a helper function in this test module, returning a tuple of layout and memory - Import it from an existing test utilities module if available elsewhere in the codebase
The absence of evidence after exhaustive search strongly indicates the function is genuinely missing.
🏁 Script executed:
#!/bin/bash
# Search for function definition of create_test_layout
rg -nP '^\s*(pub\s+)?fn\s+create_test_layout' lib/llm/src/block_manager/v2/physical/transfer/
# Also check module/use declarations in testing.rs
echo "=== Imports/modules in testing.rs ==="
head -50 lib/llm/src/block_manager/v2/physical/transfer/testing.rsLength of output: 1721
🏁 Script executed:
#!/bin/bash
# Broader search for create_test_layout definition across entire transfer module and common test locations
echo "=== Searching for fn create_test_layout definition ==="
rg -nP 'fn\s+create_test_layout' lib/llm/src/block_manager/v2/physical/
echo ""
echo "=== Checking for #[cfg(test)] and mod definitions in testing.rs ==="
head -100 lib/llm/src/block_manager/v2/physical/transfer/testing.rs | rg -nP '(use|mod|fn|pub)'
echo ""
echo "=== Looking for common test utilities patterns ==="
find lib/llm/src/block_manager/v2/physical/ -name "*.rs" -type f | xargs rg -l "test|common|util" | head -10Length of output: 1562
The second script output shows a separate tests/ subdirectory exists. Let me check if create_test_layout is defined there:
🏁 Script executed:
#!/bin/bash
# Check if create_test_layout is defined in tests submodule
echo "=== Searching tests/ subdirectory ==="
rg -nP 'fn\s+create_test_layout' lib/llm/src/block_manager/v2/physical/transfer/tests/
echo ""
echo "=== Checking tests/mod.rs for exports ==="
cat lib/llm/src/block_manager/v2/physical/transfer/tests/mod.rs | head -50
echo ""
echo "=== Checking tests/local_transfers.rs for create_test_layout ==="
rg -nP 'create_test_layout' lib/llm/src/block_manager/v2/physical/transfer/tests/local_transfers.rs -C3Length of output: 1738
Let me search the full tests/mod.rs file for all function definitions:
🏁 Script executed:
#!/bin/bash
# Search for any fn definition in tests/mod.rs that might create layouts
echo "=== All pub fn definitions in tests/mod.rs ==="
rg -nP 'pub\s+fn\s+' lib/llm/src/block_manager/v2/physical/transfer/tests/mod.rs
echo ""
echo "=== All fn definitions in tests/mod.rs ==="
rg -nP '^\s*fn\s+' lib/llm/src/block_manager/v2/physical/transfer/tests/mod.rs
echo ""
echo "=== Searching for 'layout' related helpers ==="
rg -nP '(create|make|new).*layout' lib/llm/src/block_manager/v2/physical/transfer/tests/mod.rs -i
echo ""
echo "=== File size and total line count ==="
wc -l lib/llm/src/block_manager/v2/physical/transfer/tests/mod.rsLength of output: 2112
Tests reference undefined create_test_layout; compilation will fail
The function create_test_layout(...) is called repeatedly in this file (lines 259–360) but is not defined anywhere in the codebase. Exhaustive search confirms that while similar layout helpers exist in tests/mod.rs (create_fc_layout, create_lw_layout, create_layout), none are named create_test_layout.
Add a local helper function in testing.rs that creates a test layout and returns the tuple (PhysicalLayout, memory) that the tests expect, or import and adapt an existing helper from tests/mod.rs.
| let source = PhysicalLayout::new_local(src_layout, StorageKind::System); | ||
| let intermediate = PhysicalLayout::new_local(inter_layout, StorageKind::Pinned); | ||
| let destination = PhysicalLayout::new_local(dst_layout, StorageKind::System); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Verify PhysicalLayout::new_local/new_remote arity and visibility
Calls use 2-arg constructors; ensure they match current API. Recent changes may require additional parameters (e.g., metadata).
Use this to confirm current signatures and all call sites:
Also applies to: 297-300, 323-326
🏁 Script executed:
#!/bin/bash
rg -nP -C2 'fn\s+new_(local|remote)\s*\('
rg -nP -C2 '\bPhysicalLayout::new_(local|remote)\s*\('Length of output: 8757
I need to see the complete function signatures to verify the parameter count. Let me search for the full definitions:
I need to retrieve the complete function signatures to verify the parameter count. Let me look at the full definitions:
---
🏁 Script executed:
```shell
#!/bin/bash
# Get full function signature for new_local
rg -A 10 'pub\(crate\)\s+fn new_local' lib/llm/src/block_manager/v2/physical/layout/physical.rs
# Get full function signature for new_remote (if not commented)
rg -A 10 'pub.*fn new_remote' lib/llm/src/block_manager/v2/physical/layout/physical.rs
Length of output: 638
Let me verify if there's an alternative path for new_remote or a builder function:
Let me search the codebase directly for how NixlMetadata is constructed and if there's a builder pattern:
---
🏁 Script executed:
```shell
#!/bin/bash
# Find NixlMetadata construction patterns
rg -nP 'NixlMetadata::' lib/llm/src/block_manager/v2/physical/transfer/
# Check if there's a Default impl or builder for NixlMetadata
rg -nP 'impl.*Default.*NixlMetadata|fn.*new\s*\(' lib/llm/src/block_manager/v2/physical/ -A 3
# Check the full NixlMetadata definition
rg -nP 'struct NixlMetadata|pub.*NixlMetadata' lib/llm/src/block_manager/v2/physical/ -A 5
Length of output: 12403
Calls to PhysicalLayout constructors missing required parameters
The current signatures require:
new_local(layout: Arc<dyn Layout>, location: StorageKind, nixl_metadata: NixlMetadata)— 3 parameters requirednew_remote()— currently commented out, unavailable
Issues found:
- Lines 263–265, 297–299, 323–325:
new_localcalled with only 2 arguments (missingnixl_metadata) - Lines 343, 360:
new_remotecalled but the function is commented out and unavailable
All six new_local calls and both new_remote calls will fail to compile. The test file must be updated to provide NixlMetadata for local layouts or refactored to avoid remote layouts if new_remote remains unavailable.
🤖 Prompt for AI Agents
In lib/llm/src/block_manager/v2/physical/transfer/testing.rs around lines
263–266 (and similarly at 297–299 and 323–325) the calls to
PhysicalLayout::new_local only pass two arguments but the signature requires a
third nixl_metadata parameter; create or obtain appropriate NixlMetadata (e.g.,
a default/test NixlMetadata value in this test module) and pass it as the third
argument to each new_local call. For the calls at lines ~343 and ~360 that
attempt to use new_remote (which is commented out/unavailable), either replace
them with new_local + suitable NixlMetadata or re-enable/implement new_remote;
update the test to consistently construct PhysicalLayout with the correct
three-argument new_local signature or avoid remote layouts if new_remote remains
unavailable.
| use super::*; | ||
| use crate::block_manager::v2::physical::transfer::{BounceBufferSpec, TransferCapabilities, TransferOptions}; | ||
| use crate::block_manager::v2::physical::transfer::executor::execute_transfer; | ||
| use anyhow::Result; | ||
| use rstest::rstest; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing import: Arc.
Arc is used but not imported; tests won’t compile.
use super::*;
use crate::block_manager::v2::physical::transfer::{BounceBufferSpec, TransferCapabilities, TransferOptions};
use crate::block_manager::v2::physical::transfer::executor::execute_transfer;
use anyhow::Result;
use rstest::rstest;
+use std::sync::Arc;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| use super::*; | |
| use crate::block_manager::v2::physical::transfer::{BounceBufferSpec, TransferCapabilities, TransferOptions}; | |
| use crate::block_manager::v2::physical::transfer::executor::execute_transfer; | |
| use anyhow::Result; | |
| use rstest::rstest; | |
| use super::*; | |
| use crate::block_manager::v2::physical::transfer::{BounceBufferSpec, TransferCapabilities, TransferOptions}; | |
| use crate::block_manager::v2::physical::transfer::executor::execute_transfer; | |
| use anyhow::Result; | |
| use rstest::rstest; | |
| use std::sync::Arc; |
🤖 Prompt for AI Agents
In lib/llm/src/block_manager/v2/physical/transfer/tests/local_transfers.rs
around lines 11 to 16, the tests reference Arc but it is not imported; add an
import for Arc (e.g., use std::sync::Arc;) near the other use statements so the
tests compile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where can readers go to understand the differences and motivations behind KVBM V2 upgrade from V1?
Signed-off-by: jthomson04 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 34
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
Cargo.toml (1)
1-130: Remove executable bit from Cargo.toml.Pipeline reports: "Cargo.toml marked executable but has no (or invalid) shebang!" Manifest files should not be executable.
Run this command to fix:
#!/bin/bash # Remove executable permission from Cargo.toml chmod -x Cargo.toml git add Cargo.toml
♻️ Duplicate comments (47)
lib/kvbm-kernels/README.md (3)
46-60: Add language tag to code fence.Fenced code block should specify
textor appropriate language identifier per MD040.Apply this diff:
-``` +```text . ├── Cargo.toml # Rust lib/bin targets + PyO3 feature
131-134: Fix typo: "specificed" → "specified".Correct the spelling error for clarity.
Apply this diff:
-All tensors must be CUDA accessible by the specificed device and match the expected +All tensors must be CUDA accessible by the specified device and match the expected
146-149: Align Docker image name with run.sh.Lines 147-148 use inconsistent image names. Use
kvbm-kernelto match run.sh (line 5).Apply this diff:
# Or build manually -docker build -t kvbm-kernel -docker run --rm --gpus all kvbm-kernels +docker build -t kvbm-kernel . +docker run --rm --gpus all kvbm-kernellib/kvbm-kernels/run.sh (1)
1-15: Fix SPDX header, Bash flags, and arg forwarding (pipeline blocker).Three critical issues remain from previous review:
- Missing SPDX header (pipeline failure)
- Missing safe Bash flags (
-u -o pipefail)"$@"not forwarded due to missing line continuation on line 14Apply this diff:
#!/bin/bash +# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 -set -e +set -euo pipefail IMAGE_NAME="kvbm-kernel" echo "Building Docker image..." docker build -t "$IMAGE_NAME" . echo "" echo "Running container with GPU support..." docker run --rm \ --gpus all \ - "$IMAGE_NAME" - "$@" + "$IMAGE_NAME" \ + "$@"lib/llm/src/block_manager/v2.rs (1)
1-2: Add SPDX header (pipeline blocker).File is missing the required SPDX header used consistently across the codebase.
Apply this diff:
+// SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +// SPDX-License-Identifier: Apache-2.0 + pub mod memory; pub mod physical;lib/llm/src/block_manager/v2/physical/transfer/executor/memcpy.rs (3)
6-10: Fix incorrect imports (critical).Imports reference non-existent paths.
PhysicalLayoutis inlayout::, nottransfer::, andTransferCompleteNotificationis intransfer::notifications::notification, nottransfer::context.Apply this diff:
-use crate::block_manager::v2::physical::transfer::PhysicalLayout; -use crate::block_manager::v2::physical::transfer::context::TransferCompleteNotification; -use anyhow::Result; +use crate::block_manager::v2::physical::layout::PhysicalLayout; +use crate::block_manager::v2::physical::transfer::notifications::notification::TransferCompleteNotification; +use anyhow::{bail, ensure, Result};
11-27: Correct documentation and add missing parameter description.Documentation mentions non-existent
block_pairsparameter and lacks safety/error conditions.Apply this diff:
/// Execute a memcpy transfer between host memory locations. /// /// This executor handles transfers between System and Pinned memory using -/// standard CPU memcpy operations. The transfer is synchronous and blocking. +/// standard CPU memcpy operations. Device memory is not supported. +/// The transfer is synchronous and blocking. /// /// # Arguments /// * `src` - Source physical layout /// * `dst` - Destination physical layout -/// * `block_pairs` - Pairs of (src_block_id, dst_block_id) to transfer +/// * `src_block_ids` - Source block indices +/// * `dst_block_ids` - Destination block indices (same length as src_block_ids) /// * `layer_range` - Optional range of layers to transfer (None = all layers) +/// +/// # Errors +/// Returns error if layouts are incompatible, block ID slices have different lengths, +/// layer range is invalid, or if either layout is not host-visible (System/Pinned).
28-80: Add critical safety validations: host-memory guard, bounds checks, length validation.Missing multiple critical checks:
- No host-memory validation (memcpy on device pointers is undefined behavior)
- No length check for
src_block_idsvsdst_block_ids- No bounds validation for
layer_rangeApply this diff:
// Validate layouts have compatible structure let src_layout = src.layout(); let dst_layout = dst.layout(); + // Enforce host-visible memory (System or Pinned) for memcpy path + use crate::block_manager::v2::memory::StorageKind; + let (src_loc, dst_loc) = (src.location(), dst.location()); + let is_host = |k: StorageKind| matches!(k, StorageKind::System | StorageKind::Pinned); + ensure!( + is_host(src_loc) && is_host(dst_loc), + "memcpy executor supports only host-visible memory (System/Pinned): src={src_loc:?}, dst={dst_loc:?}" + ); + if src_layout.num_layers() != dst_layout.num_layers() { - return Err(anyhow::anyhow!( + bail!( "Layouts have incompatible layer counts: src={}, dst={}", src_layout.num_layers(), dst_layout.num_layers() - )); + ); } if src_layout.outer_dim() != dst_layout.outer_dim() { - return Err(anyhow::anyhow!( + bail!( "Layouts have incompatible outer dimensions: src={}, dst={}", src_layout.outer_dim(), dst_layout.outer_dim() - )); + ); } // Determine layer range let layers = layer_range.unwrap_or(0..src_layout.num_layers()); + ensure!( + layers.start < layers.end && layers.end <= src_layout.num_layers(), + "Invalid layer range {:?} for {} layers", + layers, + src_layout.num_layers() + ); // Perform synchronous copies + ensure!( + src_block_ids.len() == dst_block_ids.len(), + "src_block_ids and dst_block_ids must have the same length (got {} vs {})", + src_block_ids.len(), + dst_block_ids.len() + ); for (&src_block_id, &dst_block_id) in src_block_ids.iter().zip(dst_block_ids.iter()) { for layer_id in layers.clone() { for outer_id in 0..src_layout.outer_dim() { // Get source and destination memory regions let src_region = src.memory_region(src_block_id, layer_id, outer_id)?; let dst_region = dst.memory_region(dst_block_id, layer_id, outer_id)?; // Validate sizes match - if src_region.size() != dst_region.size() { - return Err(anyhow::anyhow!( - "Memory region size mismatch at block=({},{}), layer={}, outer={}: src={}, dst={}", - src_block_id, - dst_block_id, - layer_id, - outer_id, - src_region.size(), - dst_region.size() - )); - } + ensure!( + src_region.size() == dst_region.size(), + "Memory region size mismatch at block=({},{}), layer={}, outer={}: src={}, dst={}", + src_block_id, + dst_block_id, + layer_id, + outer_id, + src_region.size(), + dst_region.size() + ); // Perform memcpy unsafe {lib/llm/src/block_manager/v2/physical/transfer/checksum.rs (1)
130-141: Fix UB and check cudaMemcpy return code in Device→Host pathDestination uses as_ptr() cast to mut and ignores rc. Make the Vec mutable, use as_mut_ptr(), and propagate errors.
- StorageKind::Device(_) => { - let system_region: Vec<u8> = vec![0; region.size()]; - unsafe { - cudaMemcpy( - system_region.as_ptr() as *mut std::ffi::c_void, - region.addr() as *const std::ffi::c_void, - region.size(), - cudaMemcpyKind::cudaMemcpyDeviceToHost, - ); - } - hasher.update(system_region.as_slice()); - } + StorageKind::Device(_) => { + let mut system_region: Vec<u8> = vec![0; region.size()]; + let rc = unsafe { + cudaMemcpy( + system_region.as_mut_ptr() as *mut std::ffi::c_void, + region.addr() as *const std::ffi::c_void, + region.size(), + cudaMemcpyKind::cudaMemcpyDeviceToHost, + ) + }; + if rc != 0 { + return Err(anyhow!("cudaMemcpy D→H failed with error code {}", rc)); + } + hasher.update(system_region.as_slice()); + }Also ensure the active CUDA context matches the device backing region.addr() before memcpy. Based on learnings.
lib/kvbm-kernels/pyproject.toml (1)
1-19: Add SPDX header and align license with workspace to fix CI failureMissing SPDX header and “Proprietary” license conflict with repo policy; CI is failing.
+// SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +// SPDX-License-Identifier: Apache-2.0 + [build-system] requires = ["maturin>=1.5,<2"] build-backend = "maturin" @@ authors = [{ name = "Ryan", email = "[email protected]" }] -license = { text = "Proprietary" } +license = { text = "Apache-2.0" } classifiers = [Optional: add a Trove classifier for license ("License :: OSI Approved :: Apache Software License").
lib/kvbm-kernels/build.rs (1)
1-29: Three critical issues remain unresolved from previous review.
- Missing SPDX header (pipeline blocker on line 1)
- Compile error on line 18:
cc::Build::flagexpects&str, butformat!(...)returnsString- Missing rebuild trigger: Need
println!("cargo:rerun-if-env-changed=CUDA_ARCHS");Apply this diff:
+// SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +// SPDX-License-Identifier: Apache-2.0 + fn main() { println!("cargo:rerun-if-changed=cuda/tensor_kernels.cu"); + println!("cargo:rerun-if-env-changed=CUDA_ARCHS"); let mut build = cc::Build::new(); build @@ -15,7 +18,7 @@ if arch.is_empty() { continue; } - build.flag(format!("-gencode=arch=compute_{arch},code=sm_{arch}")); + build.flag(&format!("-gencode=arch=compute_{arch},code=sm_{arch}")); } } else { // Default to Ampere (SM 80) and Hopper (SM 90) support.lib/llm/src/block_manager/v2/physical/transfer/notifications/nixl_status.rs (1)
23-33: Terminal failures are hidden; distinguish in-progress from failed states.Current logic returns
Ok(false)for any non-success status, which hides failures and cancellations. Callers cannot distinguish between "still pending" and "permanently failed."Apply this diff to surface terminal errors:
impl CompletionChecker for NixlStatusChecker { fn is_complete(&self) -> Result<bool> { - // get_xfer_status returns XferStatus enum: - // - XferStatus::Success means transfer is complete - // - XferStatus::InProgress means still pending match self.agent.get_xfer_status(&self.xfer_req) { - Ok(status) => Ok(status.is_success()), - Err(e) => Err(anyhow!("NIXL transfer status check failed: {}", e)), + Ok(status) => { + if status.is_success() { + Ok(true) + } else if status.is_in_progress() { + Ok(false) + } else { + Err(anyhow!("NIXL transfer failed: {:?}", status)) + } + } + Err(e) => Err(anyhow!("NIXL transfer status check failed: {e}")), } } }lib/bindings/python/Cargo.toml (1)
76-76: Cannot inheritcudarcfrom workspace; this crate uses an empty workspace.This crate is intentionally excluded from the root workspace (lines 4-6), so
workspace = truefails to resolvecudarc. Pipeline confirms: "error inheriting 'cudarc'... 'workspace.dependencies' was not defined."Apply this diff to pin
cudarcdirectly:-cudarc = { workspace = true, optional = true } +cudarc = { version = "0.17.1", features = ["cuda-12020"], optional = true }lib/llm/src/block_manager/v2/physical/layout/integration_tests.rs (2)
80-83: Type mismatch: build a true V2LayoutConfig, not v1.
create_v2_config()currently returns the v1 type by callingcreate_v1_config(), causing a type error. Construct aV2LayoutConfigexplicitly.Apply this diff:
-/// Create v2 layout configuration (equivalent to v1) -fn create_v2_config() -> V2LayoutConfig { - create_v1_config() -} +/// Create v2 layout configuration (equivalent to v1) +fn create_v2_config() -> V2LayoutConfig { + V2LayoutConfig::builder() + .num_blocks(NUM_BLOCKS) + .num_layers(NUM_LAYERS) + .outer_dim(OUTER_DIM) + .page_size(PAGE_SIZE) + .inner_dim(INNER_DIM) + .alignment(1) + .dtype_width_bytes(DTYPE_WIDTH_BYTES) + .build() + .unwrap() +}
107-123: Remove size assertions; v1/v2 size semantics differ by design.Lines 9-12 document that v1 returns
layer_stridewhile v2 returnsouter_stridefor size. These tests should only compare addresses.Remove size assertions here and at lines 178-185, 237-244, 355-357, 395-396:
assert_eq!( v1_region.addr(), v2_region.addr, "Address mismatch at block={}, layer={}, outer={}", block_id, layer_id, outer_id ); - assert_eq!( - v1_region.size(), - v2_region.size, - "Size mismatch at block={}, layer={}, outer={}", - block_id, - layer_id, - outer_id - );lib/llm/src/block_manager/v2/physical/transfer/tests/local_transfers.rs (1)
11-17: Missing import: Arc.
Arcis used on lines 117 and 259 but not imported. Tests won't compile.Apply this diff:
use super::*; use crate::block_manager::v2::physical::transfer::executor::execute_transfer; use crate::block_manager::v2::physical::transfer::{ BounceBufferSpec, TransferCapabilities, TransferOptions, }; use anyhow::Result; use rstest::rstest; +use std::sync::Arc;lib/kvbm-kernels/src/lib.rs (1)
1-1: Add SPDX header to resolve pipeline failure.The copyright check is failing. Add the SPDX header at the top of the file.
Apply this diff:
+// SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +// SPDX-License-Identifier: Apache-2.0 + pub mod tensor_kernels;lib/kvbm-kernels/Cargo.toml (2)
1-1: Add SPDX header to resolve copyright check failure.The copyright check is failing. Add SPDX header comments at the top using
#for TOML.Apply this diff:
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + [package]
1-1: Clear the executable bit to resolve pipeline failure.The file is marked executable without a shebang, causing the pre-merge validation to fail.
Run:
git update-index --chmod=-x lib/kvbm-kernels/Cargo.tomllib/llm/src/block_manager/v2/physical/layout/validation.rs (5)
54-67: Doc/behavior mismatch on “flexible at inner boundary”.Loop enforces non‑increasing across all dims. Allow one relaxation at the innermost boundary as documented.
Apply:
- let mut prev_stride = usize::MAX; - for (i, ¤t_stride) in stride.iter().enumerate() { - if current_stride > prev_stride { - return Err(anyhow!( - "Tensor strides must be monotonically decreasing (until inner dimension). \ - Got stride: {:?} at position {}", - stride, - i - )); - } - prev_stride = current_stride; - } + for (i, w) in stride.windows(2).enumerate() { + let (a, b) = (w[0], w[1]); + if b > a && i + 2 < stride.len() { + return Err(anyhow!( + "Strides must be non-increasing (except possibly at innermost). {:?} at window {}..{}", + stride, + i, + i + 2 + )); + } + }
69-79: Heuristic can flip format across tensors; detect inconsistency.Track first detected format and error if a later tensor disagrees.
Apply:
- let mut format = TensorFormat::Unknown; + let mut detected: Option<TensorFormat> = None; @@ - if stride[0] < stride[1] { - format = TensorFormat::HND; - } else if stride[0] > stride[1] { - format = TensorFormat::NHD; - } + let current = if stride[0] < stride[1] { + Some(TensorFormat::HND) + } else if stride[0] > stride[1] { + Some(TensorFormat::NHD) + } else { + None + }; + if let Some(cur) = current { + if let Some(prev) = detected { + if prev != cur { + return Err(anyhow!("Inconsistent tensor format across inputs: {:?} vs {:?}", prev, cur)); + } + } else { + detected = Some(cur); + } + } @@ - Ok(format) + Ok(detected.unwrap_or(TensorFormat::Unknown))
69-79: Comment says “shape and stride,” but only stride is used.Either incorporate a shape tie‑breaker heuristic or update the comment.
112-115: Unchecked dimension product can overflow.Use checked_mul and return an error (or panic if that’s acceptable).
Apply:
-pub fn determine_compressed_shape(shape: &[usize]) -> usize { - shape.iter().product() -} +pub fn determine_compressed_shape(shape: &[usize]) -> usize { + shape + .iter() + .try_fold(1usize, |acc, &d| acc.checked_mul(d)) + .expect("compressed shape overflowed usize") +}If panics are undesirable, change signature to Result.
47-52: Enforce stride/shape rank parity before heuristics.Apply:
- if stride.len() < 2 { + if stride.len() != shape.len() { + return Err(anyhow!( + "Stride/shape rank mismatch: stride {:?} ({}D) vs shape {:?} ({}D)", + stride, + stride.len(), + shape, + shape.len() + )); + } + if stride.len() < 2 { return Err(anyhow!( "Tensor must have at least 2 dimensions, got stride: {:?}", stride )); }lib/llm/src/block_manager/v2/physical/transfer/nixl_agent/config.rs (1)
6-8: Docs imply key/value params via underscores, code rejects underscores.Align docs and behavior: either document “boolean toggles only” or implement param parsing (while allowing backend names like GDS_MT).
Suggested doc tweak:
-//! environment variables with the pattern: `DYN_KVBM_NIXL_BACKEND_<backend>_<key>=<value>` +//! environment variables with the pattern: `DYN_KVBM_NIXL_BACKEND_<backend>=<bool>` +//! (custom params not yet supported; backend names are case-insensitive, underscores allowed)Or loosen parsing by splitting only on the first underscore after the backend prefix and treating the rest as the backend name until '='.
lib/kvbm-kernels/python/tests/test_tensor_kernels.py (3)
250-260: Guard CPU-only runs; this test calls .cuda() unconditionally.Apply:
def test_non_cuda_tensor_error(): @@ - device = torch.device("cpu") + if not torch.cuda.is_available(): + pytest.skip("CUDA required for this test") + device = torch.device("cpu")
1-6: Add SPDX header to satisfy CI.Apply:
+# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 """ PyTorch-based regression tests for the CUDA tensor packing kernels.
13-14: Import module name mismatch — correct import tocuda_tensor_kernels.The PyO3 module defined in lib/kvbm-kernels/src/python.rs (line 807) is named
cuda_tensor_kernels, but the test file importsdynamo_kvbm_kernels. This will cause an ImportError at runtime.Apply:
-import dynamo_kvbm_kernels as ctk +import cuda_tensor_kernels as ctklib/llm/src/block_manager/v2/physical/transfer/executor/mod.rs (2)
13-15: Verify import paths for StorageKind and TransferCompleteNotification.Based on past review comments and code structure,
StorageKindshould be imported fromcrate::block_manager::v2::memoryrather thancrate::block_manager::v2::physical::transfer. Similarly, verify thatTransferCompleteNotificationis correctly imported from thenotificationsmodule.Run the following script to verify the correct module locations:
#!/bin/bash # Find where StorageKind and TransferCompleteNotification are defined echo "=== StorageKind definition ===" rg -n "pub enum StorageKind|pub struct StorageKind" lib/llm/src/block_manager/v2/ echo "=== TransferCompleteNotification definition ===" rg -n "pub struct TransferCompleteNotification" lib/llm/src/block_manager/v2/
36-77: Validate layer_range bounds before dispatch.The function should validate that
options.layer_range, if provided, fits within both source and destination layouts'num_layers()to fail fast with a clear error message rather than allowing invalid ranges to propagate to individual executors.Apply this diff:
// Validate block IDs validate_block_transfer(src_block_ids, dst_block_ids, None, src, dst, None)?; + // Validate optional layer range against both layouts + if let Some(ref r) = options.layer_range { + let src_layers = src.layout().num_layers(); + let dst_layers = dst.layout().num_layers(); + if r.start > r.end || r.end > src_layers || r.end > dst_layers { + return Err(anyhow::anyhow!( + "Invalid layer range {:?} for src_layers={}, dst_layers={}", + r, src_layers, dst_layers + )); + } + } + // Select transfer plan based on locations and capabilities let plan = select_strategy(src, dst, ctx)?;lib/llm/src/block_manager/v2/physical/transfer/fill.rs (3)
89-95: Check cudaMemcpy return code.The return value of
cudaMemcpyis currently ignored, which can hide CUDA failures and lead to silent data corruption.Apply this diff to check the return code:
unsafe { - cudaMemcpy( + let rc = cudaMemcpy( region.addr() as *mut std::ffi::c_void, system_region.as_ptr() as *const std::ffi::c_void, region.size(), cudaMemcpyKind::cudaMemcpyHostToDevice, ); + if rc != 0 { + return Err(anyhow!("cudaMemcpy(H2D) failed with code {}", rc as i32)); + } }
98-113: Add alignment validation before O_DIRECT disk writes.With O_DIRECT enabled (as noted in disk.rs), all disk operations must be page-aligned. While the buffer is correctly 4KB-aligned via
avec!,region.addr()(file seek offset at line 109) andregion.size()(write length at line 110) are not validated against the required alignment. This can cause writes to fail with EINVAL at runtime.Add validation that
region.addr()andregion.size()are multiples of the required alignment (e.g., 4096 orconfig.alignment) before performing file operations.
152-158: Bug: fill_layers writes raw host bytes to Device/Disk addresses.
fill_layerscallsfill_memory_regionunconditionally without checkinglayout.location(). ForDevicestorage, this attempts to write from CPU to GPU virtual addresses (UB). ForDisk, it treats file offsets as memory pointers. This function should mirror the location-aware handling used infill_blocks(lines 69-114).Apply this diff to add location-aware handling:
for layer_id in layer_range.clone() { for outer_id in 0..outer_dim { let region = layout.memory_region(block_id, layer_id, outer_id)?; - fill_memory_region(region.addr(), region.size(), block_id, layer_id, pattern)?; + match layout.location() { + StorageKind::System | StorageKind::Pinned => { + fill_memory_region(region.addr(), region.size(), block_id, layer_id, pattern)?; + } + StorageKind::Device(_) => { + let mut system_region = vec![0u8; region.size()]; + fill_memory_region(system_region.as_mut_ptr() as usize, system_region.len(), block_id, layer_id, pattern)?; + unsafe { + let rc = cudaMemcpy( + region.addr() as *mut std::ffi::c_void, + system_region.as_ptr() as *const std::ffi::c_void, + region.size(), + cudaMemcpyKind::cudaMemcpyHostToDevice, + ); + if rc != 0 { + return Err(anyhow!("cudaMemcpy(H2D) failed with code {}", rc as i32)); + } + } + } + StorageKind::Disk(fd) => { + let mut system_region: AVec<u8, _> = avec![[4096]| 0; region.size()]; + fill_memory_region(system_region.as_ptr() as usize, system_region.len(), block_id, layer_id, pattern)?; + let mut file = ManuallyDrop::new(unsafe { File::from_raw_fd(fd as i32) }); + file.seek(std::io::SeekFrom::Start(region.addr() as u64))?; + file.write_all(&system_region)?; + file.sync_all()?; + } + } } }lib/llm/src/block_manager/v2/physical/transfer/notifications/nixl_events.rs (1)
156-187: Shutdown loop can spin forever if notifications are lost.After the channel closes, the loop continues polling outstanding transfers until all complete. If a NIXL notification is lost or a transfer is abandoned, this task will never exit. Consider adding a maximum wait timeout per transfer (e.g., 5 minutes), sending
Erron expiry, and logging the failure.lib/llm/src/block_manager/v2/memory/system.rs (1)
112-124: Prevent overflow in bounds check.The expression
offset + sizeat line 114 can overflow for large inputs, wrapping around and bypassing the bounds check. Usechecked_addorsaturating_addto prevent this.Apply this diff:
fn memset(&mut self, value: u8, offset: usize, size: usize) -> Result<()> { - if offset + size > self.len { + if offset.checked_add(size).map_or(true, |end| end > self.len) { return Err(StorageError::OperationFailed( "memset: offset + size > storage size".into(), )); } unsafe { let ptr = self.ptr.as_ptr().add(offset); std::ptr::write_bytes(ptr, value, size); } Ok(()) }lib/llm/src/block_manager/v2/physical/transfer/testing.rs (3)
259-266:create_test_layoutis undefined; tests won’t compileTwo fixes:
- Import an existing helper (e.g., from
tests/mod.rs:create_layout/create_fc_layout) and adapt the return type.- Or define a local helper that builds a small
LayoutConfig+ backing memory and returns aPhysicalLayout-compatibleArc<dyn Layout>.I can generate a local helper once you confirm the expected return tuple shape.
263-266: Constructor arity mismatch:PhysicalLayout::new_localneedsNixlMetadataAdd the third parameter. If
NixlMetadata: Default, passNixlMetadata::default().- let source = PhysicalLayout::new_local(src_layout, StorageKind::System); + let source = PhysicalLayout::new_local(src_layout, StorageKind::System, NixlMetadata::default());Repeat for
intermediateanddestination.If
Defaultisn’t implemented, point me to the constructor and I’ll wire it up.Also applies to: 297-300, 323-325
342-344:new_remotelikely unavailable; tests should not depend on itThese tests assert error paths but will fail to compile if
new_remoteis disabled/commented out.Options:
- Gate these tests with
#[cfg(feature = "remote-layouts")].- Or refactor to simulate “remote” via a local layout and assert
RoundTripTest::newrejects it based onis_remote()toggled by test-only API.I can draft a gated version with feature flags.
Also applies to: 357-360
lib/kvbm-kernels/src/tensor_kernels.rs (5)
1-1: Add SPDX header to satisfy CI+// SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +// SPDX-License-Identifier: Apache-2.0 + //! Safe-ish wrappers around the CUDA block/universal packing kernels.
87-101: FFI constness mismatch:operational_ptrs_deviceshould be*const *mut c_voidC prototype uses
void* const*(pointer-to-const-pointer-to mutable). Update extern.fn launch_operational_copy( block_ptrs_host: *const *const c_void, block_ptrs_device: *const *const c_void, operational_ptrs_host: *const *mut c_void, - operational_ptrs_device: *const *const c_void, + operational_ptrs_device: *const *mut c_void, num_blocks: usize,
182-196: Propagate FFI fix into wrapper signaturepub unsafe fn operational_copy( block_ptrs_host: *const *const c_void, block_ptrs_device: *const *const c_void, operational_ptrs_host: *const *mut c_void, - operational_ptrs_device: *const *const c_void, + operational_ptrs_device: *const *mut c_void,
427-439: Adjust test cast to*const *mut c_void- let operational_ptrs_device_ptr = - operational_ptrs_device_raw as usize as *const *const c_void; + let operational_ptrs_device_ptr = + operational_ptrs_device_raw as usize as *const *mut c_void;
481-487: Adjust the remaining casts to match corrected FFI- let operational_ptrs_device_const = - operational_ptrs_device_raw as usize as *const *const c_void; + let operational_ptrs_device_const = + operational_ptrs_device_raw as usize as *const *mut c_void; @@ - block_ptrs_device_raw as usize as *const *const c_void, + block_ptrs_device_raw as usize as *const *const c_void,Note: only the operational table changes to
*const *mut c_void; block table stays*const *const c_void.Also applies to: 488-503
lib/llm/src/block_manager/v2/physical/transfer/strategy.rs (2)
81-111: Remote policy diverges from documented “Conservative Default Policy”; add tests for select_strategyselect_strategy routes mixed-locality flows to select_remote_strategy_v2, which returns errors for Device↔Remote without RDMA and any Disk involvement. The docs (Lines 124–141) stage via TwoHop instead. Either restore TwoHop fallback in v2 for parity, or update docs/tests and harden all execute callers to handle errors. Also add unit tests covering select_strategy directly to prevent drift.
Run to locate affected call sites and missing tests:
#!/bin/bash rg -nP '\bselect_strategy\s*\(' -C2 rg -nP '\bselect_remote_strategy_v2\b' -C2 fd -e rs lib/llm/src | rg -n "select_strategy|select_remote_strategy_v2"
258-270: Use short‑circuit OR and keep capability checks consistent
- Use || instead of | to avoid needless RHS eval and match style elsewhere (Line 265 uses ||).
- Prefer capabilities.allows_device_remote_direct() for consistency with select_remote_strategy.
Patch:
- if matches!(src, StorageKind::Disk(_)) | matches!(dst, StorageKind::Disk(_)) { + if matches!(src, StorageKind::Disk(_)) || matches!(dst, StorageKind::Disk(_)) { return Err(anyhow::anyhow!( "Neither local nor remote disk transfers are supported over NIXL at this time." )); } - if !capabilities.allow_gpu_rdma + if !capabilities.allows_device_remote_direct() && (matches!(src, StorageKind::Device(_)) || matches!(dst, StorageKind::Device(_))) {lib/kvbm-kernels/src/python.rs (2)
1-1: Fix CI: add SPDX header above crate attribute to avoid shebang false-positivePre-commit treats the first line as a shebang; add SPDX header comments first. This also satisfies the copyright check.
Apply:
+// SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +// SPDX-License-Identifier: Apache-2.0 #![allow(unsafe_op_in_unsafe_fn)]
224-229: Create/bind CUDA context only after empty-input early-returnEmpty batches should be a no-op without initializing CUDA. Move context creation/bind/stream after computing items and the early return in all four functions.
Suggested patches (repeat pattern per function):
@@ unsafe fn block_to_universal(... ) -> PyResult<()> { - let ctx = get_context()?; - ctx.bind_to_thread() - .map_err(|e| PyRuntimeError::new_err(format!("Failed to bind context: {:?}", e)))?; - let stream = ctx.default_stream(); - let layout_enum = parse_layout(layout)?; + let layout_enum = parse_layout(layout)?; @@ - if universal_items.is_empty() { + if universal_items.is_empty() { return Ok(()); } + let ctx = get_context()?; + ctx.bind_to_thread() + .map_err(|e| PyRuntimeError::new_err(format!("Failed to bind context: {:?}", e)))?; + let stream = ctx.default_stream();@@ unsafe fn universal_to_block(... ) -> PyResult<()> { - let ctx = get_context()?; - ctx.bind_to_thread() - .map_err(|e| PyRuntimeError::new_err(format!("Failed to bind context: {:?}", e)))?; - let stream = ctx.default_stream(); - let layout_enum = parse_layout(layout)?; + let layout_enum = parse_layout(layout)?; @@ - if universal_items.is_empty() { + if universal_items.is_empty() { return Ok(()); } + let ctx = get_context()?; + ctx.bind_to_thread() + .map_err(|e| PyRuntimeError::new_err(format!("Failed to bind context: {:?}", e)))?; + let stream = ctx.default_stream();@@ unsafe fn block_to_operational(... ) -> PyResult<()> { - let ctx = get_context()?; - ctx.bind_to_thread() - .map_err(|e| PyRuntimeError::new_err(format!("Failed to bind context: {:?}", e)))?; - let stream = ctx.default_stream(); - - let backend = parse_backend(backend)?; + let backend = parse_backend(backend)?; @@ - if operational_items.is_empty() { + if operational_items.is_empty() { return Ok(()); } + let ctx = get_context()?; + ctx.bind_to_thread() + .map_err(|e| PyRuntimeError::new_err(format!("Failed to bind context: {:?}", e)))?; + let stream = ctx.default_stream();@@ unsafe fn operational_to_block(... ) -> PyResult<()> { - let ctx = get_context()?; - ctx.bind_to_thread() - .map_err(|e| PyRuntimeError::new_err(format!("Failed to bind context: {:?}", e)))?; - let stream = ctx.default_stream(); - - let backend = parse_backend(backend)?; + let backend = parse_backend(backend)?; @@ - if operational_items.is_empty() { + if operational_items.is_empty() { return Ok(()); } + let ctx = get_context()?; + ctx.bind_to_thread() + .map_err(|e| PyRuntimeError::new_err(format!("Failed to bind context: {:?}", e)))?; + let stream = ctx.default_stream();Also applies to: 367-372, 513-517, 666-670
🧹 Nitpick comments (30)
lib/kvbm-kernels/README.md (2)
29-35: Add blank line before table for markdown linting.Markdown tables should be surrounded by blank lines per MD058.
Apply this diff:
### Layout Cheat Sheet + | Term | Logical Shape | Stored As | Notes |
158-164: Add blank line after table for markdown linting.Per MD058, tables should be surrounded by blank lines.
Apply this diff:
| Kernels take unexpected time | Verify that `CUDA_ARCHS` matches your GPU to avoid JIT at runtime | + - `backend="auto"` defaults to the fused kernel, then `cudaMemcpyBatchAsync`, then `cudaMemcpyAsync`. Override if you want to benchmark a specific path.lib/llm/src/block_manager/v2/physical/transfer/checksum.rs (1)
142-149: Avoid mutating shared fd offset; use pread-style read_atseek + read_exact changes the file offset of a shared fd, risking races. Use FileExt::read_at loop to read the region without moving the offset.
- let mut file = ManuallyDrop::new(unsafe { File::from_raw_fd(fd as i32) }); - file.seek(std::io::SeekFrom::Start(region.addr() as u64))?; - file.read_exact(&mut system_region)?; + use std::os::unix::fs::FileExt; + let file = ManuallyDrop::new(unsafe { File::from_raw_fd(fd as i32) }); + let mut off = 0usize; + while off < system_region.len() { + let n = file.read_at(&mut system_region[off..], region.addr() as u64 + off as u64)?; + if n == 0 { + return Err(anyhow!("short read from disk-backed region")); + } + off += n; + }If disk accesses are concurrent elsewhere, this prevents offset interference. Please confirm the fd is shared across threads; if not, this change is still safer.
lib/llm/src/block_manager/v2/physical/transfer/capabilities.rs (1)
142-147: Probe cost and side effects; document and guard usagewith_gds_if_supported performs a live transfer and caches once. Document that it may allocate GPU/Disk resources and avoid calling on latency-sensitive paths; consider an env override to skip probing.
lib/runtime/src/config.rs (1)
400-411: Use local Result alias for consistencyPrefer the module’s Result alias over anyhow::Result for a uniform public surface.
-pub fn parse_bool(val: &str) -> anyhow::Result<bool> { +pub fn parse_bool(val: &str) -> Result<bool> {Optional: add unit tests for parse_bool(true/false/invalid) alongside is_truthy/is_falsey tests.
lib/llm/src/block_manager/v2/physical/transfer/tests/cuda_kernel_tests.rs (1)
12-56: Consider reusing existing test helpers.The helper functions
create_fc_layoutandcreate_lw_layoutduplicate logic fromlib/llm/src/block_manager/v2/physical/transfer/tests/mod.rs(lines 114-149). Consider importing and reusing those helpers instead to reduce duplication and ensure consistency.lib/llm/src/block_manager/v2/physical/manager/handle.rs (1)
43-46: Guard reserved bits; add TryFrom/u128 and constants for clarity.Currently from_u128 accepts nonzero reserved bits silently. Add explicit masks/shift constants and a TryFrom that errors when bits 80..127 are set; keep from_u128 for trusted inputs.
Additions outside the shown lines:
impl LayoutHandle { pub const WORKER_MASK: u128 = 0xFFFF_FFFF_FFFF_FFFF; pub const LAYOUT_MASK: u128 = 0xFFFF; pub const LAYOUT_SHIFT: u32 = 64; pub const RESERVED_MASK: u128 = !((Self::WORKER_MASK) | ((Self::LAYOUT_MASK as u128) << Self::LAYOUT_SHIFT)); } impl std::convert::TryFrom<u128> for LayoutHandle { type Error = &'static str; fn try_from(value: u128) -> Result<Self, Self::Error> { if value & LayoutHandle::RESERVED_MASK != 0 { return Err("reserved bits (80..127) must be zero"); } Ok(LayoutHandle(value)) } }Optionally mark the newtype as repr(transparent) for FFI:
#[repr(transparent)] pub struct LayoutHandle(u128);lib/llm/src/block_manager/v2/physical/layout/tests.rs (1)
146-151: Avoid hard-coding version “1” in JSON assertions.Use CURRENT_VERSION to prevent future breakage.
Apply:
- let json = serialized.to_json().expect("failed to serialize to JSON"); - assert!(json.contains("\"version\":1")); + let json = serialized.to_json().expect("failed to serialize to JSON"); + let parsed = LayoutDescriptor::from_json(&json).expect("json roundtrip"); + assert_eq!(parsed.version, LayoutDescriptor::CURRENT_VERSION); assert!(json.contains("\"num_blocks\":10"));lib/llm/src/block_manager/v2/physical/manager/remote.rs (1)
89-102: Use CURRENT_VERSION in tests; drop allow(dead_code) if no longer needed.Prevents future drift; remove #[allow(dead_code)] once used in manager.
Apply:
- LayoutDescriptor { - version: 1, + LayoutDescriptor { + version: LayoutDescriptor::CURRENT_VERSION,lib/llm/src/block_manager/v2/physical/transfer/options.rs (1)
23-48: Solid options surface; consider Debug derive.Add Debug for easier logging/inspection during transfers.
Apply:
-#[derive(Clone, Default, Builder)] +#[derive(Debug, Clone, Default, Builder)] pub struct TransferOptions {lib/kvbm-kernels/python/tests/test_tensor_kernels.py (1)
102-118: Use zip(strict=True) to catch length mismatches early.Tightens tests and silences linters.
Apply:
- for produced, expected in zip(outputs, universals): + for produced, expected in zip(outputs, universals, strict=True): @@ - for produced_set, expected_set in zip(blocks, expected_blocks): - for produced, expected in zip(produced_set, expected_set): + for produced_set, expected_set in zip(blocks, expected_blocks, strict=True): + for produced, expected in zip(produced_set, expected_set, strict=True): @@ - for produced_set, expected_set in zip(blocks, reference_blocks): - for produced, expected in zip(produced_set, expected_set): + for produced_set, expected_set in zip(blocks, reference_blocks, strict=True): + for produced, expected in zip(produced_set, expected_set, strict=True):Also applies to: 158-176
lib/llm/src/block_manager/v2/physical/transfer/fill.rs (1)
80-87: Declaresystem_regionas mutable.The vector
system_regionis passed tofill_memory_region, which writes to the memory. Although the current code works becausefill_memory_regiontakes ausizeaddress, declaring the vector asmutmakes the intent clearer and is more idiomatic.Apply this diff:
- let system_region: Vec<u8> = vec![0; region.size()]; + let mut system_region: Vec<u8> = vec![0; region.size()];lib/llm/src/block_manager/v2/physical/transfer/notifications/nixl_events.rs (1)
64-64: Consider adjusting the polling interval.The 1ms interval may be appropriate for low-latency requirements, but it will consume significant CPU resources when transfers are outstanding. Consider making this configurable or increasing it slightly (e.g., 10ms) if latency requirements permit.
lib/llm/src/block_manager/v2/physical/transfer/executor/mod.rs (1)
277-303: Naming confusion: TransferNotification vs TransferCompleteNotification.The module defines both
TransferNotification(line 277) and usesTransferCompleteNotification(imported at line 14). These similar names may cause confusion. Consider renamingTransferNotificationto something more distinct or consolidating if they serve the same purpose.lib/llm/src/block_manager/v2/physical/transfer/notifications/mod.rs (1)
81-81: Consider adjusting the polling interval.The 1ms interval will consume significant CPU resources when transfers are outstanding. Consider making this configurable or increasing it (e.g., 10ms) if latency requirements permit, or document why such aggressive polling is necessary.
lib/llm/src/block_manager/v2/physical/layout/fully_contiguous.rs (2)
162-165: UseLayoutConfig::required_bytes()for the single allocation sizeAvoid recomputing and potential overflow.
- vec![self.block_stride * self.config.num_blocks] + vec![self.config.required_bytes()]
137-140: Consider exposing a mut slice of owned regions for registration symmetry
memory_arc_mut()diverges fromLayerSeparateLayout::memory_regions_mut(). Prefer a consistent API that returns&mut [OwnedMemoryRegion]for NIXL registration.lib/llm/src/block_manager/v2/memory/disk.rs (2)
61-73: Propagatemkostemperrors precisely; consider usingO_CLOEXECas wellUse
O_CLOEXECto avoid fd leaks across exec; current code omits it.- nix::libc::mkostemp( + nix::libc::mkostemp( template_bytes.as_mut_ptr() as *mut c_char, - nix::libc::O_RDWR | nix::libc::O_DIRECT, + nix::libc::O_RDWR | nix::libc::O_DIRECT | nix::libc::O_CLOEXEC, )
85-103: Opening specific path: addO_CLOEXEC; mapopenerror vianixorio::ErrorMinor robustness improvement.
- nix::libc::open( + nix::libc::open( path_cstr.as_ptr(), - nix::libc::O_CREAT | nix::libc::O_RDWR | nix::libc::O_DIRECT, + nix::libc::O_CREAT | nix::libc::O_RDWR | nix::libc::O_DIRECT | nix::libc::O_CLOEXEC, 0o644, )lib/llm/src/block_manager/v2/memory/pinned.rs (1)
38-66: Safer allocation preconditions and docs
- Docs mention
device_idbut ctor doesn’t accept one; either add a parameter or adjust docs.- Replace
assert!s withdebug_assert!s or fail withStorageErrorto avoid aborts in release.- /// * `device_id` - CUDA device to associate with the allocation + /// Associated with device 0 (update if multi-device association is needed). @@ - assert!(!ptr.is_null(), "Failed to allocate pinned memory"); - assert!(ptr.is_aligned(), "Pinned memory is not aligned"); - assert!(len < isize::MAX as usize); + if ptr.is_null() { + return Err(StorageError::AllocationFailed("cudaHostAlloc returned null".into())); + } + debug_assert!(ptr.is_aligned(), "Pinned memory is not aligned"); + debug_assert!(len < isize::MAX as usize);lib/llm/src/block_manager/v2/physical/manager/metadata.rs (1)
51-59: Add explicit versioning to the wire type.
RdmaLayoutDescriptorslacks a version field. Introduceversion: u32now to allow evolution without breaking decode.Apply this diff:
@@ #[derive(Debug, Encode, Decode)] pub struct RdmaLayoutDescriptors { + /// Serialization format version + pub version: u32, /// Worker identification pub worker_address: WorkerAddress, @@ - let inner = RdmaLayoutDescriptors { + let inner = RdmaLayoutDescriptors { + version: 1, worker_address, nixl_metadata, layouts, };Update tests to set/verify
version.lib/llm/src/block_manager/v2/physical/transfer/strategy.rs (1)
124-141: Add unit tests for select_strategy (not only select_direct_strategy)Current tests cover select_direct_strategy. Add select_strategy tests for:
- Local Host ↔ Remote (direct)
- Local Device → Remote without RDMA (TwoHop)
- Remote → Local Device without RDMA (TwoHop)
- Disk ↔ Remote (TwoHop)
This will catch policy drift regressions flagged above.Example scaffolding (place in existing tests module):
#[test] fn select_strategy_device_remote_without_rdma_twohop() { let caps = TransferCapabilities::default(); // RDMA off let ctx = TransferContext::for_tests().with_capabilities(caps); let src = local_device_layout(); let dst = remote_host_layout(); let plan = select_strategy(&src, &dst, &ctx).unwrap(); assert!(matches!(plan, TransferPlan::TwoHop{..})); }Also applies to: 283-506
lib/llm/src/block_manager/v2/physical/transfer/validation.rs (2)
183-191: if-let chain requires Rust 1.70+; consider simple nested if for broader MSRVThe pattern
if let Some(bounce_ids) = bounce_block_ids && bounce_ids.len() != src_block_ids.len()needs let_chains. If MSRV < 1.70, rewrite to a nested if.Alternative:
if let Some(bounce_ids) = bounce_block_ids { if bounce_ids.len() != src_block_ids.len() { return Err(BlockValidationError::LengthMismatch { src_len: src_block_ids.len(), dst_len: dst_block_ids.len(), bounce_len: Some(bounce_ids.len()), }); } }
166-213: Redundant cfg block inside debug-only functionThe inner #[cfg(debug_assertions)] block is inside a function already gated by the same cfg. You can drop the inner cfg for clarity.
lib/llm/src/block_manager/v2/memory/registered.rs (1)
166-195: Propagate registration error details while preserving storageregister_with_nixl discards the underlying error. Consider returning a typed error that includes the original error and the storage for recovery.
Sketch:
pub enum RegisterError<S> { Nixl { storage: S, source: nixl_sys::Error } } pub fn register_with_nixl<S>(storage: S, agent: &NixlAgent, opt: Option<&OptArgs>) -> Result<NixlRegistered<S>, RegisterError<S>> where S: MemoryRegion + NixlCompatible { // ... match agent.register_memory(&descriptor, opt) { Ok(handle) => Ok(NixlRegistered { storage, handle: Some(handle), agent_name: agent.name().to_string() }), Err(e) => Err(RegisterError::Nixl { storage, source: e }), } }lib/llm/src/block_manager/v2/physical/transfer/executor/nixl.rs (1)
197-206: Optional: call validation before building descriptorsInvoke validate_block_transfer to catch length/uniqueness/range issues early.
Example:
use super::validation::validate_block_transfer; validate_block_transfer( src_block_ids, dst_block_ids, None, src, dst, None, )?;Also applies to: 261-287
lib/kvbm-kernels/cuda/tensor_kernels.cu (2)
201-203: Remove unnecessaryconst_castin device kernel.
block_chunksalready has typeT* const*. Theconst_castadds no value and can be dropped:- T* const* block_base = const_cast<T* const*>(block_chunks + block_idx * block_stride); + T* const* block_base = block_chunks + block_idx * block_stride;Keeps intent clear and avoids superfluous casts.
103-115: Grid sizing cap is safe but conservative.Clamping
gridDim.xto 65,535 is portable but can increase per-thread stride on large transfers. Consider 2D grid decomposition for very large totals to improve occupancy on newer architectures.lib/llm/src/block_manager/v2/physical/layout/builder.rs (1)
224-231:numa_awareis accepted but unused. Either plumb it or drop it.
allocate_pinned(...)exposesnuma_aware, butallocate_pinned_entryignores it. This is misleading API.Options:
- Wire
numa_awareintoPinnedStorage::new(...)or an allocator option.- Or remove the parameter and simplify the API.
I can provide a follow-up diff once the intended behavior is confirmed.
Also applies to: 406-410
lib/llm/src/block_manager/v2/physical/transfer/tests/mod.rs (1)
391-416: Cache guard: considerstd::sync::Mutexunlessparking_lotis required.For tests,
std::sync::Mutexis sufficient and avoids an extra dependency. If you keepparking_lot, ensure it’s listed in [dev-dependencies].
| if (!block_ptrs_host || !operational_ptrs_host || !operational_ptrs_device) { | ||
| return cudaErrorInvalidValue; | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix backend-specific validation and Auto fallback; current logic can silently no-op.
- Lines 547-550: Validation incorrectly requires
operational_ptrs_deviceeven for memcpy backends, causingcudaErrorInvalidValuewhen memcpy would suffice. - Lines 570-574: Kernel path returns
cudaSuccesswhen device pointer tables are missing, preventing Auto from falling back to memcpy.
Apply these minimal fixes:
@@
- if (!block_ptrs_host || !operational_ptrs_host || !operational_ptrs_device) {
- return cudaErrorInvalidValue;
- }
+ // Validate only what every path needs; backend-specific checks happen later.
+ if (!block_ptrs_host || !operational_ptrs_host) {
+ return cudaErrorInvalidValue;
+ }
@@
- if (!block_ptrs_device) {
- return cudaSuccess;
- }
+ // Both device-side pointer tables must be present for the kernel backend.
+ if (!block_ptrs_device || !operational_ptrs_device) {
+ return cudaErrorInvalidValue;
+ }With this, Auto will properly try kernel first and then fall back to batch/async memcpy when device tables are unavailable.
Also applies to: 570-574, 668-677
🤖 Prompt for AI Agents
In lib/kvbm-kernels/cuda/tensor_kernels.cu around lines 547-550, 570-574 and
668-677: the validation currently requires operational_ptrs_device even when the
memcpy backend would be used and the kernel path returns cudaSuccess when device
pointer tables are missing, preventing Auto from falling back to memcpy. Fix by
making the initial validation accept nullptr operational_ptrs_device for
memcpy/backends that don't need device tables (i.e., only require
block_ptrs_host and operational_ptrs_host for memcpy paths), and in the kernel
execution path change the early-success return to return a non-success CUDA
error (e.g., cudaErrorInvalidValue) when required device pointer tables are
absent so Auto can detect failure and fall back to batch/async memcpy; apply
equivalent condition/return fixes for the checks at 668-677 as well.
| if !file_path.exists() { | ||
| std::fs::create_dir_all(file_path.parent().unwrap()).unwrap(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid unwrap() on parent(); handle rootless paths and map errors
- if !file_path.exists() {
- std::fs::create_dir_all(file_path.parent().unwrap()).unwrap();
- }
+ if !file_path.exists() {
+ if let Some(parent) = file_path.parent() {
+ std::fs::create_dir_all(parent).map_err(|e| {
+ StorageError::AllocationFailed(format!("create_dir_all failed: {e}"))
+ })?;
+ }
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if !file_path.exists() { | |
| std::fs::create_dir_all(file_path.parent().unwrap()).unwrap(); | |
| } | |
| if !file_path.exists() { | |
| if let Some(parent) = file_path.parent() { | |
| std::fs::create_dir_all(parent).map_err(|e| { | |
| StorageError::AllocationFailed(format!("create_dir_all failed: {e}")) | |
| })?; | |
| } | |
| } |
🤖 Prompt for AI Agents
In lib/llm/src/block_manager/v2/memory/disk.rs around lines 47 to 49, the code
calls file_path.parent().unwrap() and then unwraps the result of create_dir_all;
instead handle the case where parent() is None (rootless or path without a
parent) and propagate or map filesystem errors rather than panicking. Change to
check if let Some(parent) = file_path.parent() and only call
std::fs::create_dir_all(parent), returning or mapping any io::Error from
create_dir_all into the function's Result (or logging and returning an
appropriate error) so there are no unwrap() calls and rootless paths are handled
gracefully.
| impl Drop for DiskStorage { | ||
| fn drop(&mut self) { | ||
| let _ = self.unlink(); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File descriptor leak: Drop unlinks but never closes the fd
fd is never closed; this leaks descriptors under load. Close before unlinking.
impl Drop for DiskStorage {
fn drop(&mut self) {
- let _ = self.unlink();
+ // Close fd first, then best-effort unlink.
+ #[cfg(unix)]
+ unsafe {
+ let _ = nix::libc::close(self.fd as i32);
+ }
+ let _ = self.unlink();
}
}Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In lib/llm/src/block_manager/v2/memory/disk.rs around lines 146 to 150, the Drop
impl currently calls unlink() but never closes the underlying file descriptor,
leaking fds under load; modify Drop to first close the file descriptor (e.g.,
take ownership of the fd/File and call close()/drop it or call the existing
close method) and only after successful (or attempted) close call unlink();
ensure you take the fd out (Option::take or similar) to avoid double-close and
handle/ignore errors from close/unlink consistently (log or ignore) so resources
are released deterministically.
| #[cfg(not(unix))] | ||
| { | ||
| // On non-Unix systems, we can't get the file descriptor easily | ||
| // Return device_id as 0 - registration will fail on these systems | ||
| ( | ||
| self.mmap.as_ptr(), | ||
| self.mmap.len(), | ||
| nixl_sys::MemType::File, | ||
| 0, | ||
| ) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-Unix branch references self.mmap, which doesn’t exist here
This won’t compile for non-Unix targets. Return a null pointer and fd 0.
- #[cfg(not(unix))]
- {
- // On non-Unix systems, we can't get the file descriptor easily
- // Return device_id as 0 - registration will fail on these systems
- (
- self.mmap.as_ptr(),
- self.mmap.len(),
- nixl_sys::MemType::File,
- 0,
- )
- }
+ #[cfg(not(unix))]
+ {
+ (std::ptr::null(), self.size, nixl_sys::MemType::File, 0)
+ }🤖 Prompt for AI Agents
In lib/llm/src/block_manager/v2/memory/disk.rs around lines 184 to 195, the
non-Unix cfg branch incorrectly references self.mmap (which doesn't exist)
causing compile errors on non-Unix targets; change the returned tuple to use a
null pointer (std::ptr::null()) for the pointer, the length to 0 (or appropriate
zeroed usize), the MemType to nixl_sys::MemType::File, and device id/file
descriptor to 0 so the branch compiles and signals registration will fail on
non-Unix systems.
| impl actions::Slice for MemoryDescriptor { | ||
| fn as_slice(&self) -> Result<&[u8]> { | ||
| Ok(unsafe { std::slice::from_raw_parts(self.addr as *const u8, self.size) }) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Unsound: returning a slice to arbitrary raw address is UB-prone
MemoryDescriptor::as_slice builds &[u8] from an arbitrary (addr, size) without provenance/lifetime guarantees. This is undefined behavior if memory isn’t valid, mapped, and immutable for the borrow. It also fabricates a borrow tied to &self.
Prefer one of:
- Remove this impl entirely; provide explicit,
unsafe fn as_slice_unchecked(&self) -> &'static [u8]onMemoryDescriptorfor tightly‑scoped internal use only. - Or change the trait usage to operate on concrete
MemoryRegiontypes that can validate provenance (e.g.,SystemStorage,PinnedStorage) and implementSlicethere. - Or return a boxed copy (
Vec<u8>) via a safe read path when possible.
If you want, I can draft a patch once we confirm where actions::Slice on MemoryDescriptor is used.
Script to find call sites:
🏁 Script executed:
#!/bin/bash
rg -nP --type=rust 'MemoryDescriptor\s*[:>{]' -C2
rg -nP --type=rust 'as_slice\s*\(\)' -C2Length of output: 47283
Now let me search the codebase for the actions::Slice trait definition:
Let me search the codebase directly for the trait definition:
[/run_scripts]
Now let me check the actual call sites in fill.rs and checksum.rs more carefully:
[/run_scripts]
Based on my analysis of the script output and the Rust documentation I reviewed:
The std::slice::from_raw_parts safety contract requires that data must be valid for reads and properly aligned, and the entire memory range must be contained within a single allocated object. Additionally, there is no guarantee that the given pointer is valid for the specified length elements, nor whether the inferred lifetime is suitable for the returned slice.
The review comment correctly identifies that MemoryDescriptor::as_slice() violates these safety requirements by:
- Creating a slice from arbitrary
(addr, size)without validation that the memory is actually allocated or valid - Tying the slice lifetime to
&self, which has no relationship to the actual memory's lifetime - Not ensuring memory belongs to a single allocation or is properly aligned
- Providing no documentation of the unsafe preconditions required for callers
The implementation is indeed unsound. The usage in fill.rs and checksum.rs may work in practice if the memory happens to be valid at those call sites, but the trait implementation itself is a soundness violation that could lead to UB.
Unsound: returning a slice to arbitrary raw address violates std::slice::from_raw_parts safety contract
MemoryDescriptor::as_slice builds &[u8] from an arbitrary (addr, size) without provenance/lifetime guarantees. The safety contract requires that data be valid for reads and properly aligned, and the entire memory range must be contained within a single allocated object. This implementation cannot uphold these requirements, yet exposes a safe method that uses unsafe. The lifetime is tied to &self, unrelated to the actual memory's existence.
Remove this impl actions::Slice for MemoryDescriptor entirely, or provide unsafe fn as_slice_unchecked(&self) -> &'static [u8] for tightly-scoped internal use with explicit SAFETY comments documenting the caller's obligations.
🏁 Script executed:
#!/bin/bash
# Find the actions::Slice trait definition
rg -nP --type=rust 'trait\s+Slice' -C5Length of output: 1897
🏁 Script executed:
#!/bin/bash
# Look at the context around as_slice() calls in fill.rs
rg -nP --type=rust -A5 -B5 'as_slice\(\)' lib/llm/src/block_manager/v2/physical/transfer/fill.rsLength of output: 1975
Unsound: safe trait method violates documented Safety contract by exposing arbitrary-memory slice
The Slice trait (defined at lib/llm/src/block_manager/v2/memory/actions.rs lines 25+) documents safety preconditions: "The caller must ensure the memory region is valid and initialized." However, impl Slice for MemoryDescriptor exposes a safe method that uses unsafe { std::slice::from_raw_parts() } without any validation capability. MemoryDescriptor is just (addr: usize, size: usize)—it has no means to guarantee memory validity, alignment, or lifetime. This allows callers to construct arbitrary slices without verifying preconditions, violating Rust's safety guarantee.
Remove impl actions::Slice for MemoryDescriptor entirely. If a slice is needed, provide an explicit unsafe fn as_slice_unchecked(&self) with documented SAFETY requirements for internal use only, or validate memory provenance at the point where MemoryDescriptor is created.
🤖 Prompt for AI Agents
In lib/llm/src/block_manager/v2/memory/mod.rs around lines 202 to 205, the safe
impl of actions::Slice for MemoryDescriptor unsafely constructs a slice from an
arbitrary addr/size, violating the trait's documented safety contract; remove
this impl entirely and replace it with either an explicit unsafe fn
as_slice_unchecked(&self) with a SAFETY comment outlining that the caller must
guarantee valid, initialized, properly aligned memory and lifetime, or perform
validation when MemoryDescriptor instances are created so a safe as_slice can be
provided; ensure the public API no longer exposes a safe as_slice that can be
called without upholding the documented invariants.
| #[test] | ||
| fn test_require_backend() { | ||
| let agent = NixlAgent::new_with_backends("test", &["UCX"]).expect("Need UCX for test"); | ||
|
|
||
| // Should succeed for available backend | ||
| assert!(agent.require_backend("UCX").is_ok()); | ||
|
|
||
| // Should fail for unavailable backend | ||
| assert!(agent.require_backend("GDS_MT").is_err()); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Brittle test: hard‑requires UCX backend; mark ignored or gate by feature
test_require_backend will fail in environments without UCX. Align with neighboring tests by marking it #[ignore] or gating behind a cargo feature/env flag.
Example:
- #[test]
- fn test_require_backend() {
+ #[test]
+ #[ignore] // Requires UCX backend present in the environment
+ fn test_require_backend() {📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #[test] | |
| fn test_require_backend() { | |
| let agent = NixlAgent::new_with_backends("test", &["UCX"]).expect("Need UCX for test"); | |
| // Should succeed for available backend | |
| assert!(agent.require_backend("UCX").is_ok()); | |
| // Should fail for unavailable backend | |
| assert!(agent.require_backend("GDS_MT").is_err()); | |
| } | |
| #[test] | |
| #[ignore] // Requires UCX backend present in the environment | |
| fn test_require_backend() { | |
| let agent = NixlAgent::new_with_backends("test", &["UCX"]).expect("Need UCX for test"); | |
| // Should succeed for available backend | |
| assert!(agent.require_backend("UCX").is_ok()); | |
| // Should fail for unavailable backend | |
| assert!(agent.require_backend("GDS_MT").is_err()); | |
| } |
🤖 Prompt for AI Agents
In lib/llm/src/block_manager/v2/physical/transfer/nixl_agent/mod.rs around lines
236-245, the test test_require_backend hard-depends on the UCX backend and will
fail in environments without UCX; update the test to either mark it #[ignore] so
it’s skipped by default or gate it behind a Cargo feature or runtime env check
(e.g., wrap with #[cfg(feature = "ucx_tests")] or check an env var at test start
and early-return with eprintln!); ensure the chosen approach matches neighboring
tests’ style and add a short comment explaining why the test is ignored/gated.
| while !outstanding.is_empty() { | ||
| check_interval.tick().await; | ||
|
|
||
| let mut completed = Vec::new(); | ||
|
|
||
| for (uuid, transfer) in outstanding.iter() { | ||
| match transfer.checker.is_complete() { | ||
| Ok(true) => { | ||
| completed.push((*uuid, Ok(()))); | ||
| } | ||
| Ok(false) => { | ||
| // Still pending | ||
| } | ||
| Err(e) => { | ||
| warn!( | ||
| uuid = %uuid, | ||
| error = %e, | ||
| "Transfer status check failed during shutdown" | ||
| ); | ||
| completed.push((*uuid, Err(e))); | ||
| } | ||
| } | ||
| } | ||
|
|
||
| for (uuid, result) in completed { | ||
| if let Some(transfer) = outstanding.remove(&uuid) { | ||
| let _ = transfer.done.send(result); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shutdown drain loop can spin indefinitely.
After the channel closes, the loop continues polling until all outstanding transfers complete. If a transfer checker never returns true (due to a bug, lost event, or abandoned transfer), this task will never exit. Consider adding a maximum wait time or timeout per transfer to prevent indefinite spinning.
🤖 Prompt for AI Agents
In lib/llm/src/block_manager/v2/physical/transfer/notifications/mod.rs around
lines 146 to 175, the shutdown drain loop can spin forever if a transfer.checker
never returns true; add a timeout mechanism (either per-transfer deadline
timestamp recorded when shutdown starts or an overall max-wait deadline) and
enforce it inside the loop: record start/deadline for each outstanding transfer,
on each iteration check elapsed time and if a transfer has exceeded its deadline
treat it as failed (send an Err with a timeout error and remove it from
outstanding); use tokio::time::Instant/Duration (or tokio::time::timeout applied
to the checker call) to implement the timeout and ensure the transfer.done
channel is signaled with an appropriate error so the task can exit even if
checkers never report completion.
| // Disk → Remote - always stage through host | ||
| Disk(_) => TransferPlan::TwoHop { | ||
| first: NixlWrite, | ||
| bounce_location: Pinned, | ||
| second: NixlWrite, | ||
| }, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disk → Remote bounce plan: first hop should read from Disk, not write
For staged Disk → Remote, the first hop should mirror Disk → Host (read from Disk to Pinned). Using NixlWrite for the first hop is inconsistent with your own Disk→Host path (Lines 173–176) and Disk↔Disk TwoHop (Lines 179–183).
Apply:
- Disk(_) => TransferPlan::TwoHop {
- first: NixlWrite,
+ Disk(_) => TransferPlan::TwoHop {
+ first: NixlReadFlipped,
bounce_location: Pinned,
second: NixlWrite,
},📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Disk → Remote - always stage through host | |
| Disk(_) => TransferPlan::TwoHop { | |
| first: NixlWrite, | |
| bounce_location: Pinned, | |
| second: NixlWrite, | |
| }, | |
| } | |
| // Disk → Remote - always stage through host | |
| Disk(_) => TransferPlan::TwoHop { | |
| first: NixlReadFlipped, | |
| bounce_location: Pinned, | |
| second: NixlWrite, | |
| }, |
🤖 Prompt for AI Agents
In lib/llm/src/block_manager/v2/physical/transfer/strategy.rs around lines 239
to 245, the Disk→Remote TwoHop plan incorrectly uses NixlWrite for the first
hop; change the first hop to read from Disk (use NixlRead) so the plan becomes
first: NixlRead, bounce_location: Pinned, second: NixlWrite to match the
Disk→Host and Disk↔Disk TwoHop behavior.
| // Helper to create a minimal transfer context for testing | ||
| // In real tests with CUDA/NIXL, this would be properly constructed | ||
| fn create_test_context() -> TransferContext { | ||
| // For now, we'll skip these tests if CUDA is not available | ||
| // In the future, we can mock TransferContext or use conditional compilation | ||
| todo!("Create test context - requires CUDA/NIXL setup") | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing test context stub blocks compilation
create_test_context() is todo!(). Either gate tests behind cfg or provide a minimal mock.
Add #[cfg(feature = "transfer-tests")] around async tests or implement a simple mock TransferContext builder used only for unit tests.
🤖 Prompt for AI Agents
In lib/llm/src/block_manager/v2/physical/transfer/testing.rs around lines
247-253, create_test_context() currently calls todo!(), which breaks
compilation; either (A) wrap the async tests that call create_test_context()
with #[cfg(feature = "transfer-tests")] (and add a corresponding dev-feature in
Cargo.toml) so the function can remain unimplemented for non-test builds, or (B)
implement a minimal test-only mock TransferContext returned by
create_test_context() (behind #[cfg(test)] or #[cfg(feature =
"transfer-tests")]) that constructs whatever minimal fields TransferContext
requires for the tests to run without CUDA/NIXL; choose one approach and apply
cfg attributes consistently to the tests and the stub builder so non-CUDA builds
compile.
| /// Create a test agent with optimal backends for testing. | ||
| /// | ||
| /// Attempts to initialize UCX, GDS, and POSIX backends. Falls back gracefully | ||
| /// if some backends are unavailable (e.g., GDS on non-DGX machines). | ||
| pub fn create_test_agent(name: &str) -> NixlAgent { | ||
| NixlAgent::require_backends(name, &[]).expect("Failed to require backends") | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doc/impl mismatch: backends.
Comment says “Attempts to initialize UCX, GDS, and POSIX,” but code uses require_backends(name, &[]). Either update the doc or call new_with_backends(name, &["UCX","GDS_MT","POSIX"]) and handle partial availability.
🤖 Prompt for AI Agents
lib/llm/src/block_manager/v2/physical/transfer/tests/mod.rs around lines
101-107: the doc claims we "Attempts to initialize UCX, GDS, and POSIX" but the
code calls require_backends(name, &[]); change the implementation to actually
attempt those backends by calling NixlAgent::new_with_backends(name,
&["UCX","GDS_MT","POSIX"]) and handle partial availability by catching errors,
logging which backends failed, and falling back to progressively smaller subsets
(or finally to require_backends(name, &[])) so the function matches the doc and
gracefully degrades when some backends are unavailable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 27
| "lib/bindings/c", | ||
| "lib/bindings/python/codegen", | ||
| "lib/engines/*", | ||
| "lib/kvbm-kernels" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Verify CI/toolchain alignment for new member and CUDA feature; fix executable bit on lib/kvbm-kernels/Cargo.toml.
- Ensure builders have CUDA Toolkit matching cudarc feature "cuda-12020" (CUDA 12.2.x). Otherwise, builds will fail on CI/workers. Also confirm NVCC availability for cc CUDA mode.
- Pipeline flagged lib/kvbm-kernels/Cargo.toml as executable and auto-formatted by clang-format; drop the executable bit.
🏁 Script executed:
#!/usr/bin/env bash
# Read-only checks
echo "== CUDA version =="
nvcc --version 2>/dev/null | sed -n '1,4p' || echo "nvcc not found"
echo
echo "== cudarc feature in workspace =="
rg -nP '\bcudarc\b.*cuda-12020' -n -C1 Cargo.toml lib/**/Cargo.toml
echo
echo "== File mode for lib/kvbm-kernels/Cargo.toml =="
stat --format '%A %n' lib/kvbm-kernels/Cargo.toml 2>/dev/null || ls -l lib/kvbm-kernels/Cargo.toml
echo
echo "== Pre-commit status (clang-format) =="
git status --porcelain | grep 'lib/kvbm-kernels/Cargo.toml' || true
echo
echo "== Suggested fix (manual) =="
echo "Run: git update-index --chmod=-x lib/kvbm-kernels/Cargo.toml && pre-commit run -a"Length of output: 473
Fix executable bit on lib/kvbm-kernels/Cargo.toml and verify CI CUDA environment.
The file mode check confirms lib/kvbm-kernels/Cargo.toml has the executable bit incorrectly set (-rwxr-xr-x). Remove it:
git update-index --chmod=-x lib/kvbm-kernels/Cargo.tomlAdditionally, ensure CI/build workers have CUDA Toolkit 12.2.x installed to match the cudarc feature "cuda-12020" declared in Cargo.toml line 68. Builds will fail on CI if the CUDA version does not align.
🤖 Prompt for AI Agents
Cargo.toml (lib/kvbm-kernels) around line 15: the file has its executable bit
set incorrectly and CI may mismatch CUDA version; remove the executable
permission from lib/kvbm-kernels/Cargo.toml (clear the x bit in the index so the
file is non-executable) and commit that change, then verify CI/build workers
have CUDA Toolkit 12.2.x to match the cudarc feature "cuda-12020" declared at
line 68 so builds use the compatible CUDA version.
| tokenizer = tokenizer_factory(arg_map["model"]) | ||
| default_sampling_params = SamplingParams() | ||
| default_sampling_params._setup(tokenizer) | ||
| default_sampling_params.end_id = tokenizer.eos_token_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Good move away from private API, but verify completeness.
Replacing the private _setup method call with direct attribute assignment improves maintainability by avoiding internal API dependencies. However, please verify that the old _setup(tokenizer) method only set the end_id attribute and didn't perform any additional initialization that might now be missing.
Run the following script to locate the _setup method implementation and verify what it does:
🏁 Script executed:
#!/bin/bash
# Description: Find the _setup method implementation in SamplingParams to verify completeness
# Search for _setup method definition in Python files
echo "=== Searching for _setup method definition ==="
rg -n -A 10 "def _setup" --type=py
# Search for other calls to _setup on SamplingParams to see usage patterns
echo -e "\n=== Searching for other _setup usage patterns ==="
rg -n -B 2 -A 2 "SamplingParams.*_setup|sampling_params.*_setup" --type=pyLength of output: 97
🏁 Script executed:
#!/bin/bash
# Find where SamplingParams is imported from
echo "=== Searching for SamplingParams import in main.py ==="
rg -n "SamplingParams|from.*import.*Sampling|import.*Sampling" components/src/dynamo/trtllm/main.py --type=py -A 2 -B 2
# Show context around line 272 to understand the full change
echo -e "\n=== Context around line 272 (20 lines before and after) ==="
sed -n '250,290p' components/src/dynamo/trtllm/main.py
# Search for any _setup calls in the entire codebase to see if the pattern exists elsewhere
echo -e "\n=== All _setup method calls (not just definitions) ==="
rg -n "\._setup\s*\(" --type=py -B 1 -A 1Length of output: 328
---
🏁 Script executed:
```shell
#!/bin/bash
# Find main.py files in the repository
echo "=== Searching for main.py files ==="
fd "main\.py" --type f
# Search for any dynamo or trtllm related directories
echo -e "\n=== Looking for dynamo/trtllm structure ==="
fd "dynamo|trtllm" --type d | head -20
# Search for SamplingParams to locate relevant files
echo -e "\n=== Files containing SamplingParams ==="
rg -l "SamplingParams" --type=py
Length of output: 1108
---
🏁 Script executed:
```shell
#!/bin/bash
# Display context around line 272
echo "=== Content around line 272 (260-285) ==="
sed -n '260,285p' components/src/dynamo/trtllm/main.py
# Show imports at top of file
echo -e "\n=== Imports section (first 50 lines) ==="
head -50 components/src/dynamo/trtllm/main.py | grep -A 50 "^import\|^from"
# Search for SamplingParams source
echo -e "\n=== SamplingParams import ==="
rg -n "SamplingParams" components/src/dynamo/trtllm/main.py -B 5 -A 2 | head -30
# Check if there are any references to _setup in handler_base.py since it also has SamplingParams
echo -e "\n=== Check handler_base.py for _setup usage ==="
rg -n "_setup" components/src/dynamo/trtllm/request_handlers/handler_base.py -B 2 -A 2
Length of output: 3774
---
🌐 Web query:
```web
TensorRT-LLM SamplingParams initialization eos_token_id
💡 Result:
Use the SamplingParams "end_id" (not "eos_token_id"). You can either set it explicitly or let TensorRT‑LLM fill it from the LLM's tokenizer:
-
Explicit:
sampling_params = SamplingParams(end_id=tokenizer.eos_token_id, pad_id=tokenizer.pad_token_id). [1][2] -
Implicit (leave end_id None): TensorRT‑LLM will call sampling_params._setup(tokenizer) and set end_id = tokenizer.eos_token_id when a tokenizer is available (otherwise it raises). [1][3]
References:
[1] TensorRT‑LLM Python module showing _setup/end_id behavior.
[2] Runtime docs showing SamplingConfig(end_id=...) usage.
[3] Examples converting GenerationConfig → SamplingParams (uses end_id=config.eos_token_id).
---
🏁 Script executed:
```shell
#!/bin/bash
# Check if pad_id is set anywhere in the file
echo "=== Searching for pad_id in main.py ==="
rg -n "pad_id" components/src/dynamo/trtllm/main.py -B 2 -A 2
# Check if pad_id is used elsewhere in SamplingParams usage
echo -e "\n=== Searching for pad_id in handler_base.py ==="
rg -n "pad_id" components/src/dynamo/trtllm/request_handlers/handler_base.py -B 2 -A 2
# Check for SamplingParams usage patterns in vllm handlers for comparison
echo -e "\n=== Searching for pad_id in vllm handlers ==="
rg -n "pad_id" components/src/dynamo/vllm/handlers.py -B 2 -A 2
# Look for any other SamplingParams initializations
echo -e "\n=== All SamplingParams() creations in trtllm module ==="
rg -n "SamplingParams()" components/src/dynamo/trtllm/ -B 1 -A 3
Length of output: 2161
Refactoring appears incomplete—verify pad_id is being set correctly.
The old _setup(tokenizer) method would set both end_id and pad_id, but the new code only explicitly sets end_id on line 272. The explicit initialization pattern is SamplingParams(end_id=tokenizer.eos_token_id, pad_id=tokenizer.pad_token_id), yet pad_id is not assigned anywhere in your code.
Verify that omitting pad_id won't affect model behavior, or add:
default_sampling_params.pad_id = tokenizer.pad_token_id🤖 Prompt for AI Agents
In components/src/dynamo/trtllm/main.py at line 272, the refactor sets
default_sampling_params.end_id = tokenizer.eos_token_id but never sets pad_id;
restore the original behavior by assigning the tokenizer pad id (either when
constructing SamplingParams or immediately after) so pad_id is set to
tokenizer.pad_token_id to match the previous SamplingParams(end_id=...,
pad_id=...) initialization.
|
|
||
| dlpark = { version = "0.5", features = ["pyo3", "half"], optional = true } | ||
| cudarc = { version = "0.16.2", features = ["cuda-12020"], optional = true } | ||
| cudarc = { version = "0.17.1", features = ["cuda-12020"], optional = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Good: stop inheriting cudarc; now pinned. Verify feature and toolchain.
- Ensure
cuda-12020is a valid feature for cudarc0.17.1. - Confirm CI toolchain for this crate supports the required CUDA SDK.
🌐 Web query:
Does cudarc 0.17.1 expose a `cuda-12020` feature? If not, what is the correct feature flag for CUDA 12.2?
💡 Result:
No. cudarc 0.17.1 did not expose a cuda-12020 feature. The correct feature name for CUDA 12.2 is cuda-12020 (added in later cudarc versions) — either upgrade cudarc (e.g., 0.17.3 shows cuda-12020) or use the cuda-version-from-build-system feature. [1][2]
Sources:
- cudarc features listing (older 0.17.x snapshot showing no cuda-12020). [1]
- cudarc features (shows cuda-12020 present in later 0.17.x). [2]
Feature cuda-12020 does not exist in cudarc 0.17.1—this will cause a build failure.
The cuda-12020 feature was added in later cudarc versions (e.g., 0.17.3+). Either upgrade cudarc to 0.17.3 or later, or use the cuda-version-from-build-system feature with the current version.
🤖 Prompt for AI Agents
In lib/bindings/python/Cargo.toml around line 76, the dependency specifies a
non-existent feature "cuda-12020" for cudarc 0.17.1 which will cause a build
failure; fix by either bumping the cudarc version to 0.17.3 or later where
"cuda-12020" exists, or keep cudarc = "0.17.1" and replace the feature with
"cuda-version-from-build-system" (or remove the optional feature) so the
Cargo.toml references a valid feature set and the build can succeed.
| fn as_slice_typed<T>(&self) -> Result<&[T], StorageError> { | ||
| let bytes = self.as_slice()?; | ||
| let ptr = bytes.as_ptr() as *const T; | ||
| let len = bytes.len() / std::mem::size_of::<T>(); | ||
|
|
||
| if !(bytes.as_ptr() as usize).is_multiple_of(std::mem::align_of::<T>()) { | ||
| return Err(StorageError::Unsupported(format!( | ||
| "memory not aligned for type (required alignment: {})", | ||
| std::mem::align_of::<T>() | ||
| ))); | ||
| } | ||
|
|
||
| if bytes.len() % std::mem::size_of::<T>() != 0 { | ||
| return Err(StorageError::Unsupported(format!( | ||
| "size {} is not a multiple of type size {}", | ||
| bytes.len(), | ||
| std::mem::size_of::<T>() | ||
| ))); | ||
| } | ||
|
|
||
| // SAFETY: Caller guarantees memory is valid, aligned, and properly initialized for T | ||
| Ok(unsafe { std::slice::from_raw_parts(ptr, len) }) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle zero-sized types in typed slice creation.
size_of::<T>() == 0 will divide by zero. Reject ZSTs explicitly.
Apply this diff:
fn as_slice_typed<T>(&self) -> Result<&[T], StorageError> {
- let bytes = self.as_slice()?;
+ if std::mem::size_of::<T>() == 0 {
+ return Err(StorageError::Unsupported("ZST not supported".into()));
+ }
+ let bytes = self.as_slice()?;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| fn as_slice_typed<T>(&self) -> Result<&[T], StorageError> { | |
| let bytes = self.as_slice()?; | |
| let ptr = bytes.as_ptr() as *const T; | |
| let len = bytes.len() / std::mem::size_of::<T>(); | |
| if !(bytes.as_ptr() as usize).is_multiple_of(std::mem::align_of::<T>()) { | |
| return Err(StorageError::Unsupported(format!( | |
| "memory not aligned for type (required alignment: {})", | |
| std::mem::align_of::<T>() | |
| ))); | |
| } | |
| if bytes.len() % std::mem::size_of::<T>() != 0 { | |
| return Err(StorageError::Unsupported(format!( | |
| "size {} is not a multiple of type size {}", | |
| bytes.len(), | |
| std::mem::size_of::<T>() | |
| ))); | |
| } | |
| // SAFETY: Caller guarantees memory is valid, aligned, and properly initialized for T | |
| Ok(unsafe { std::slice::from_raw_parts(ptr, len) }) | |
| } | |
| fn as_slice_typed<T>(&self) -> Result<&[T], StorageError> { | |
| if std::mem::size_of::<T>() == 0 { | |
| return Err(StorageError::Unsupported("ZST not supported".into())); | |
| } | |
| let bytes = self.as_slice()?; | |
| let ptr = bytes.as_ptr() as *const T; | |
| let len = bytes.len() / std::mem::size_of::<T>(); | |
| if !(bytes.as_ptr() as usize).is_multiple_of(std::mem::align_of::<T>()) { | |
| return Err(StorageError::Unsupported(format!( | |
| "memory not aligned for type (required alignment: {})", | |
| std::mem::align_of::<T>() | |
| ))); | |
| } | |
| if bytes.len() % std::mem::size_of::<T>() != 0 { | |
| return Err(StorageError::Unsupported(format!( | |
| "size {} is not a multiple of type size {}", | |
| bytes.len(), | |
| std::mem::size_of::<T>() | |
| ))); | |
| } | |
| // SAFETY: Caller guarantees memory is valid, aligned, and properly initialized for T | |
| Ok(unsafe { std::slice::from_raw_parts(ptr, len) }) | |
| } |
🤖 Prompt for AI Agents
In lib/llm/src/block_manager/v2/memory/actions.rs around lines 67 to 89, the
function as_slice_typed<T> can divide by zero for zero-sized types (ZSTs); add
an explicit check rejecting ZSTs before doing any division or alignment logic
(e.g. let elem_size = std::mem::size_of::<T>(); if elem_size == 0 { return
Err(StorageError::Unsupported("zero-sized types not supported".into())); }),
then use elem_size for len and size checks and keep the existing alignment and
size-multiple checks and the final unsafe slice creation.
| fn slice_typed<T>(&self, offset: usize, len: usize) -> Result<&[T], StorageError> { | ||
| let type_size = std::mem::size_of::<T>(); | ||
| let byte_len = len | ||
| .checked_mul(type_size) | ||
| .ok_or_else(|| StorageError::Unsupported("length overflow".into()))?; | ||
|
|
||
| let bytes = self.slice(offset, byte_len)?; | ||
| let ptr = bytes.as_ptr() as *const T; | ||
|
|
||
| if !(bytes.as_ptr() as usize).is_multiple_of(std::mem::align_of::<T>()) { | ||
| return Err(StorageError::Unsupported(format!( | ||
| "memory not aligned for type (required alignment: {})", | ||
| std::mem::align_of::<T>() | ||
| ))); | ||
| } | ||
|
|
||
| // SAFETY: Caller guarantees memory is valid, aligned, and properly initialized for T | ||
| Ok(unsafe { std::slice::from_raw_parts(ptr, len) }) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Also guard ZSTs in slice_typed.
Mirror the ZST check here before checked_mul.
fn slice_typed<T>(&self, offset: usize, len: usize) -> Result<&[T], StorageError> {
- let type_size = std::mem::size_of::<T>();
+ let type_size = std::mem::size_of::<T>();
+ if type_size == 0 {
+ return Err(StorageError::Unsupported("ZST not supported".into()));
+ }🤖 Prompt for AI Agents
In lib/llm/src/block_manager/v2/memory/actions.rs around lines 104 to 122, add a
guard for zero-sized types (ZSTs) before the checked_mul: if
std::mem::size_of::<T>() == 0 return an &[T] of length len without calling
self.slice or performing alignment checks (use a dangling pointer for the
from_raw_parts construction), otherwise proceed with the existing checked_mul,
self.slice call and alignment checks as before.
| Some(notif) => { | ||
| outstanding.insert(notif.uuid, OutstandingPollingTransfer { | ||
| checker: notif.checker, | ||
| done: notif.done, | ||
| arrived_at: Instant::now(), | ||
| last_warned_at: None, | ||
| }); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard against duplicate UUID registration.
A second registration overwrites the first without signaling its oneshot, leaking waiters. Warn and decide whether to reject or replace.
Apply this diff:
- Some(notif) => {
- outstanding.insert(notif.uuid, OutstandingPollingTransfer {
+ Some(notif) => {
+ if outstanding.contains_key(¬if.uuid) {
+ warn!(uuid = %notif.uuid, "Duplicate polling registration; replacing existing checker");
+ }
+ outstanding.insert(notif.uuid, OutstandingPollingTransfer {
checker: notif.checker,
done: notif.done,
arrived_at: Instant::now(),
last_warned_at: None,
});📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Some(notif) => { | |
| outstanding.insert(notif.uuid, OutstandingPollingTransfer { | |
| checker: notif.checker, | |
| done: notif.done, | |
| arrived_at: Instant::now(), | |
| last_warned_at: None, | |
| }); | |
| } | |
| Some(notif) => { | |
| if outstanding.contains_key(¬if.uuid) { | |
| warn!(uuid = %notif.uuid, "Duplicate polling registration; replacing existing checker"); | |
| } | |
| outstanding.insert(notif.uuid, OutstandingPollingTransfer { | |
| checker: notif.checker, | |
| done: notif.done, | |
| arrived_at: Instant::now(), | |
| last_warned_at: None, | |
| }); | |
| } |
| Some(notif) => { | ||
| outstanding.insert(notif.uuid, OutstandingTransfer { | ||
| xfer_req: notif.xfer_req, | ||
| done: notif.done, | ||
| arrived_at: Instant::now(), | ||
| last_warned_at: None, | ||
| }); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Warn on duplicate UUID registration.
Overwriting an existing entry drops the previous oneshot sender without signaling. Add a warning or reject duplicates.
Apply this diff:
- Some(notif) => {
- outstanding.insert(notif.uuid, OutstandingTransfer {
+ Some(notif) => {
+ if outstanding.contains_key(¬if.uuid) {
+ warn!(uuid = %notif.uuid, "Duplicate NIXL registration; replacing existing entry");
+ }
+ outstanding.insert(notif.uuid, OutstandingTransfer {
xfer_req: notif.xfer_req,
done: notif.done,
arrived_at: Instant::now(),
last_warned_at: None,
});| #[builder(default, setter(strip_option, into))] | ||
| pub bounce_buffer: Option<Arc<dyn BounceBufferSpec>>, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Builder setter likely fails with Arc: drop into for dyn-Arc
setter(strip_option, into) with Option<Arc<dyn BounceBufferSpec>> commonly fails because Arc<Concrete> does not implement Into<Arc<dyn Trait>>. Also, this object is shared across threads—add Send+Sync bounds.
- #[builder(default, setter(strip_option, into))]
- pub bounce_buffer: Option<Arc<dyn BounceBufferSpec>>,
+ #[builder(default, setter(strip_option))]
+ pub bounce_buffer: Option<Arc<dyn BounceBufferSpec + Send + Sync>>,Also consider adding #[derive(Debug)] on TransferOptions for easier logging.
-#[derive(Clone, Default, Builder)]
+#[derive(Clone, Default, Debug, Builder)]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #[builder(default, setter(strip_option, into))] | |
| pub bounce_buffer: Option<Arc<dyn BounceBufferSpec>>, | |
| } | |
| #[builder(default, setter(strip_option))] | |
| pub bounce_buffer: Option<Arc<dyn BounceBufferSpec + Send + Sync>>, | |
| } |
| // SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| // SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| //! Unit tests for CUDA kernel compatibility and execution. | ||
| use super::{NixlAgent, create_test_agent}; | ||
| use crate::block_manager::v2::physical::layout::{BlockDimension, LayoutConfig, PhysicalLayout}; | ||
| use crate::block_manager::v2::physical::transfer::executor::cuda::try_execute_operational_kernel; | ||
| use crate::block_manager::v2::physical::transfer::{StorageKind, TransferContext}; | ||
| use dynamo_kvbm_kernels::tensor_kernels::OperationalCopyBackend; | ||
|
|
||
| /// Create a fully contiguous layout for testing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gate CUDA kernel tests behind a CUDA feature.
Without CUDA, these tests will fail on CI. Gate the module or add a device-availability check.
-//! Unit tests for CUDA kernel compatibility and execution.
+//! Unit tests for CUDA kernel compatibility and execution.
+#![cfg(feature = "cuda")]If you prefer runtime gating, probe device availability and return early when absent.
🤖 Prompt for AI Agents
In lib/llm/src/block_manager/v2/physical/transfer/tests/cuda_kernel_tests.rs
around lines 1 to 12, the CUDA unit tests are unguarded and will fail on CI when
CUDA is not available; gate the tests by either adding a compile-time feature
flag (e.g., #[cfg(feature = "cuda")] or #[cfg(target_os = "cuda")]) around the
module or test functions, or add a runtime check at test startup that probes
CUDA device availability and returns early (skipping the test) when no device is
present; ensure the gating is applied to the module or each test so tests are
skipped when CUDA support/device is absent.
| /// Note: The agent should already have backends configured. Use `create_test_agent` | ||
| /// or `build_agent_with_backends` to create properly configured agents. | ||
| pub fn create_transfer_context( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doc nit: reference correct function name.
Mention create_test_agent_with_backends, not build_agent_with_backends.
-/// or `build_agent_with_backends` to create properly configured agents.
+/// or `create_test_agent_with_backends` to create properly configured agents.🤖 Prompt for AI Agents
In lib/llm/src/block_manager/v2/physical/transfer/tests/mod.rs around lines 165
to 167, the doc comment references the wrong helper name; update the text to
mention create_test_agent_with_backends instead of build_agent_with_backends so
the documentation points to the correct function for creating agents with
backends configured.
Signed-off-by: jthomson04 <[email protected]>
32c24bd to
9746be2
Compare
Signed-off-by: jthomson04 <[email protected]>
583aecb to
bda39dd
Compare
Summary by CodeRabbit
New Features
Dependencies