feat(server): add DFlash disk prefix cache for target layer split#325
feat(server): add DFlash disk prefix cache for target layer split#325weicj wants to merge 7 commits into
Conversation
There was a problem hiding this comment.
8 issues found across 44 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="server/src/common/dflash_draft_ipc.cpp">
<violation number="1" location="server/src/common/dflash_draft_ipc.cpp:39">
P3: The new IPC env/size helper block duplicates existing helper logic from `qwen35_target_shard_ipc.cpp`; extract/shared utility should be used to avoid behavior drift across IPC clients.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
| return transport; | ||
| } | ||
|
|
||
| bool checked_mul_size(size_t a, size_t b, size_t & out) { |
There was a problem hiding this comment.
P3: The new IPC env/size helper block duplicates existing helper logic from qwen35_target_shard_ipc.cpp; extract/shared utility should be used to avoid behavior drift across IPC clients.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/dflash_draft_ipc.cpp, line 39:
<comment>The new IPC env/size helper block duplicates existing helper logic from `qwen35_target_shard_ipc.cpp`; extract/shared utility should be used to avoid behavior drift across IPC clients.</comment>
<file context>
@@ -11,12 +11,71 @@
+ return transport;
+}
+
+bool checked_mul_size(size_t a, size_t b, size_t & out) {
+ if (a != 0 && b > std::numeric_limits<size_t>::max() / a) {
+ return false;
</file context>
Record the Luce-Org#291/Luce-Org#290 draft-residency integration, newly non-draft Luce-Org#321/Luce-Org#325 classification, validation, and retained worktree/transcript paths for the May 31 13:30 UTC run.
There was a problem hiding this comment.
2 issues found across 8 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="server/src/laguna/laguna_layer_split_adapter.cpp">
<violation number="1" location="server/src/laguna/laguna_layer_split_adapter.cpp:507">
P1: `snapshot_adopt` shares `ctx/buf` across shard snapshots before validation, so failure paths can double-free the adopted snapshot memory.</violation>
</file>
<file name="server/src/common/dflash_draft_ipc_daemon.cpp">
<violation number="1">
P1: Continuing after a malformed `propose_pipe` leaves unread payload bytes in the pipe, which can desynchronize subsequent requests.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
| for (auto & shard_snap : snap.shards) { | ||
| shard_snap.attn_k.assign(shards_.empty() ? 0 : shards_.front().weights.n_layer, nullptr); | ||
| shard_snap.attn_v.assign(shards_.empty() ? 0 : shards_.front().weights.n_layer, nullptr); | ||
| shard_snap.ctx = ctx; |
There was a problem hiding this comment.
P1: snapshot_adopt shares ctx/buf across shard snapshots before validation, so failure paths can double-free the adopted snapshot memory.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/laguna/laguna_layer_split_adapter.cpp, line 507:
<comment>`snapshot_adopt` shares `ctx/buf` across shard snapshots before validation, so failure paths can double-free the adopted snapshot memory.</comment>
<file context>
@@ -332,6 +367,203 @@ bool LagunaLayerSplitAdapter::snapshot_restore(int slot) {
+ for (auto & shard_snap : snap.shards) {
+ shard_snap.attn_k.assign(shards_.empty() ? 0 : shards_.front().weights.n_layer, nullptr);
+ shard_snap.attn_v.assign(shards_.empty() ? 0 : shards_.front().weights.n_layer, nullptr);
+ shard_snap.ctx = ctx;
+ shard_snap.buf = buf;
+ shard_snap.cur_pos = cur_pos;
</file context>
| @@ -21,11 +21,17 @@ | |||
| #include <cstddef> | |||
There was a problem hiding this comment.
P1: Continuing after a malformed propose_pipe leaves unread payload bytes in the pipe, which can desynchronize subsequent requests.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/dflash_draft_ipc_daemon.cpp, line 454:
<comment>Continuing after a malformed `propose_pipe` leaves unread payload bytes in the pipe, which can desynchronize subsequent requests.</comment>
<file context>
@@ -451,7 +451,7 @@ int run_dflash_draft_ipc_daemon(const char * draft_path,
line.c_str());
stream_status(stream_fd, -1);
- break;
+ continue;
}
if (!read_exact_fd(payload_fd, noise_embed.data(), bytes)) {
</file context>
| #include <cstddef> | |
| break; |
Record the exact Luce-Org#290/Luce-Org#291 merges, current Luce-Org#321/Luce-Org#325 classification, retained worktrees, and validation for the 2026-05-31 13:57 integration run.
Selectively carries the same-backend Qwen3.5 layer-split disk prefix-cache snapshot export/adopt slice from PR Luce-Org#325 while leaving the mixed-backend runtime and Laguna cache work blocked on the larger PR Luce-Org#321 architecture reconciliation. Also refreshes the auto-integration manifest/run log with the current PR classification and retained worktree notes.
Port the remaining narrow PR Luce-Org#325 disk-prefix-cache cleanup for current layout: allow lookups against layouts learned from disk, validate the adopted snapshot against the live backend layout, and reindex on mismatch. Refresh auto-integration metadata after reprobing current non-ancestor PRs.
Port a narrow PR Luce-Org#325 IPC robustness slice: honor explicit DFlash draft IPC auto transport, size shared payload capacity from live draft dimensions, and let backend IPC auto transport fall back to stream if shared setup is unavailable.\n\nValidation: git diff --check; conflict-marker search; Codex review reported no blocking findings. Local syntax-only compile remains blocked by missing dflash27b.h in this checkout.
Selectively ports the same-backend Laguna layer-split prefix-cache disk snapshot surface from PR Luce-Org#325. Exports CPU-backed snapshot refs, adopts deserialized shard/logit tensors with temporary validation before taking ownership, and refreshes the integration manifest/probe log.
Carry the next PR Luce-Org#325 selective-port slice by narrowing the server placement validation guard: same-backend target layer split may now use --kv-cache-dir after the disk snapshot/adopt paths already ported, while mixed-backend target layer split remains blocked until remote shard disk snapshot IPC export/import exists.\n\nUpdate the auto-integration manifest with current PR classification, probe results, and retained worktrees.\n\nValidation: git diff --check; conflict-marker search in promoted files; stack.yaml syntax check via file write linter; tmux Codex review reported no findings.
Record the 2026-05-31 23:00 metadata/probe refresh: current PR-head containment, Luce-Org#321/Luce-Org#325 conflict probes, and tmux Claude/Codex read-only delegation outcomes. No source changes were promoted.
Record the 2026-06-01 00:28 unattended probe pass. No source changes were promoted; Luce-Org#321 still needs live Qwen35 mixed-target adapter wiring, while Luce-Org#325's non-Luce-Org#321 same-backend disk-prefix-cache behavior is represented pending Luce-Org#321 mixed-target wiring.
Record the 2026-06-01 unattended PR integration pass, updated PR Luce-Org#285 head containment, current selective-port conflict counts, and delegated review conclusions for the remaining Luce-Org#321/Luce-Org#325/Luce-Org#135 slices.
Record the 2026-06-01 03:25 unattended refresh, including exact open PR head containment, direct-merge conflict counts, and read-only Luce-Org#325 delegation results. No source changes were promoted.
Record current heads for PR Luce-Org#321 and Luce-Org#325 as represented by the auto-integration stack after direct merge and tmux-delegated conflict-resolution attempts confirmed the remaining diffs are already carried by current-layout port commits.
Summary
This PR restores disk prefix-cache support for same-backend target layer split, including the DFlash draft path. Target layer split already had in-process prefix-cache support, but the restart-persistent
--kv-cache-dirpath was still blocked; this PR brings the split+DFlash path back to parity with the single-backend disk-backed prefix restore behavior.Previously,
dflash_serverrejected--kv-cache-dirwhen--target-deviceswas enabled because the existing disk snapshot format only matched a single backend snapshot. Under target layer split, the live prefix state is sharded across multiple target shards, so a disk hit could not be safely exported, loaded after restart, and rebound back into the split backend.This change adds a flattened layer-split disk snapshot for the same-backend path. Each shard snapshot is exported into one disk-owned CPU snapshot with shard-prefixed tensor names, then adopted back into the shard-local snapshot slots on cache lookup. For DFlash, the snapshot also persists the draft feature mirror metadata and feature rows, so a restored target prefix can continue speculative decode instead of falling back to a target-only cache state.
Changes
snapshot_ref()andsnapshot_adopt()hooks to theLayerSplitAdapter/LayerSplitBackendboundary, so the server disk-cache layer can save and load snapshots through the generic backend interface.ls<shard>_<tensor-name>so they can be rebound to the correct target shard on load;snap_prefill_logitstensor used by prefix restore;dflash_feature_metaanddflash_feature_data;ctx/bufownership from being double-freed.--kv-cache-dirwith same-backend--target-devices.Notes
hip:0,hip:1. The first server process saved disk prefix cache; after restart, the second process logged[target-split] adopted disk snapshot,disk_hit=true,restore=true, and DFlash speculative decode with accepted draft tokens.