Skip to content

feat(server): add DFlash disk prefix cache for target layer split#325

Open
weicj wants to merge 7 commits into
Luce-Org:mainfrom
weicj:feat-layer-split-disk-prefix-cache
Open

feat(server): add DFlash disk prefix cache for target layer split#325
weicj wants to merge 7 commits into
Luce-Org:mainfrom
weicj:feat-layer-split-disk-prefix-cache

Conversation

@weicj
Copy link
Copy Markdown
Collaborator

@weicj weicj commented May 31, 2026

Summary

This PR restores disk prefix-cache support for same-backend target layer split, including the DFlash draft path. Target layer split already had in-process prefix-cache support, but the restart-persistent --kv-cache-dir path was still blocked; this PR brings the split+DFlash path back to parity with the single-backend disk-backed prefix restore behavior.

Previously, dflash_server rejected --kv-cache-dir when --target-devices was enabled because the existing disk snapshot format only matched a single backend snapshot. Under target layer split, the live prefix state is sharded across multiple target shards, so a disk hit could not be safely exported, loaded after restart, and rebound back into the split backend.

This change adds a flattened layer-split disk snapshot for the same-backend path. Each shard snapshot is exported into one disk-owned CPU snapshot with shard-prefixed tensor names, then adopted back into the shard-local snapshot slots on cache lookup. For DFlash, the snapshot also persists the draft feature mirror metadata and feature rows, so a restored target prefix can continue speculative decode instead of falling back to a target-only cache state.

Changes

  • Adds snapshot_ref() and snapshot_adopt() hooks to the LayerSplitAdapter / LayerSplitBackend boundary, so the server disk-cache layer can save and load snapshots through the generic backend interface.
  • Adds Qwen35 target-layer-split disk snapshot export/import:
    • flattens shard-local snapshot tensors into one CPU snapshot for disk persistence;
    • prefixes shard tensors as ls<shard>_<tensor-name> so they can be rebound to the correct target shard on load;
    • stores and restores the global snap_prefill_logits tensor used by prefix restore;
    • stores and restores DFlash feature-mirror state through dflash_feature_meta and dflash_feature_data;
    • validates only the tensors required by each shard's actual cache state;
    • keeps shared adopted ctx / buf ownership from being double-freed.
  • Updates disk-cache cold-start lookup so a layout learned from disk can be verified after successful backend adoption, instead of blocking the first restart-time lookup before the live layout is available.
  • Updates server placement validation to allow --kv-cache-dir with same-backend --target-devices.

Notes

  • Runtime validation passed on dual Pro VII / ROCm 6.3.3 with Qwen3.6-27B Q4 target and the DFlash Qwen3.6 draft, split across hip:0,hip:1. The first server process saved disk prefix cache; after restart, the second process logged [target-split] adopted disk snapshot, disk_hit=true, restore=true, and DFlash speculative decode with accepted draft tokens.
  • Mixed-backend target split disk cache remains a follow-up because remote shard snapshots need IPC export/import support.

@weicj weicj marked this pull request as ready for review May 31, 2026 17:24
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 issues found across 44 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/src/common/dflash_draft_ipc.cpp">

<violation number="1" location="server/src/common/dflash_draft_ipc.cpp:39">
P3: The new IPC env/size helper block duplicates existing helper logic from `qwen35_target_shard_ipc.cpp`; extract/shared utility should be used to avoid behavior drift across IPC clients.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread server/src/common/dflash_feature_ring.cpp Outdated
Comment thread server/src/laguna/laguna_layer_split_adapter.h
Comment thread server/src/common/layer_split_runtime.h Outdated
Comment thread server/src/common/dflash_draft_ipc.cpp
Comment thread server/src/qwen35/qwen35_layer_split_adapter.cpp Outdated
Comment thread server/src/common/dflash_draft_ipc_daemon.cpp
Comment thread server/src/laguna/laguna_target_loader.cpp
return transport;
}

bool checked_mul_size(size_t a, size_t b, size_t & out) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: The new IPC env/size helper block duplicates existing helper logic from qwen35_target_shard_ipc.cpp; extract/shared utility should be used to avoid behavior drift across IPC clients.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/dflash_draft_ipc.cpp, line 39:

<comment>The new IPC env/size helper block duplicates existing helper logic from `qwen35_target_shard_ipc.cpp`; extract/shared utility should be used to avoid behavior drift across IPC clients.</comment>

<file context>
@@ -11,12 +11,71 @@
+    return transport;
+}
+
+bool checked_mul_size(size_t a, size_t b, size_t & out) {
+    if (a != 0 && b > std::numeric_limits<size_t>::max() / a) {
+        return false;
</file context>

easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026
Record the Luce-Org#291/Luce-Org#290 draft-residency integration, newly non-draft Luce-Org#321/Luce-Org#325 classification, validation, and retained worktree/transcript paths for the May 31 13:30 UTC run.
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 8 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/src/laguna/laguna_layer_split_adapter.cpp">

<violation number="1" location="server/src/laguna/laguna_layer_split_adapter.cpp:507">
P1: `snapshot_adopt` shares `ctx/buf` across shard snapshots before validation, so failure paths can double-free the adopted snapshot memory.</violation>
</file>

<file name="server/src/common/dflash_draft_ipc_daemon.cpp">

<violation number="1">
P1: Continuing after a malformed `propose_pipe` leaves unread payload bytes in the pipe, which can desynchronize subsequent requests.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

for (auto & shard_snap : snap.shards) {
shard_snap.attn_k.assign(shards_.empty() ? 0 : shards_.front().weights.n_layer, nullptr);
shard_snap.attn_v.assign(shards_.empty() ? 0 : shards_.front().weights.n_layer, nullptr);
shard_snap.ctx = ctx;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: snapshot_adopt shares ctx/buf across shard snapshots before validation, so failure paths can double-free the adopted snapshot memory.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/laguna/laguna_layer_split_adapter.cpp, line 507:

<comment>`snapshot_adopt` shares `ctx/buf` across shard snapshots before validation, so failure paths can double-free the adopted snapshot memory.</comment>

<file context>
@@ -332,6 +367,203 @@ bool LagunaLayerSplitAdapter::snapshot_restore(int slot) {
+    for (auto & shard_snap : snap.shards) {
+        shard_snap.attn_k.assign(shards_.empty() ? 0 : shards_.front().weights.n_layer, nullptr);
+        shard_snap.attn_v.assign(shards_.empty() ? 0 : shards_.front().weights.n_layer, nullptr);
+        shard_snap.ctx = ctx;
+        shard_snap.buf = buf;
+        shard_snap.cur_pos = cur_pos;
</file context>

@@ -21,11 +21,17 @@
#include <cstddef>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Continuing after a malformed propose_pipe leaves unread payload bytes in the pipe, which can desynchronize subsequent requests.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/dflash_draft_ipc_daemon.cpp, line 454:

<comment>Continuing after a malformed `propose_pipe` leaves unread payload bytes in the pipe, which can desynchronize subsequent requests.</comment>

<file context>
@@ -451,7 +451,7 @@ int run_dflash_draft_ipc_daemon(const char * draft_path,
                              line.c_str());
                 stream_status(stream_fd, -1);
-                break;
+                continue;
             }
             if (!read_exact_fd(payload_fd, noise_embed.data(), bytes)) {
</file context>
Suggested change
#include <cstddef>
break;

easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026
Record the exact Luce-Org#290/Luce-Org#291 merges, current Luce-Org#321/Luce-Org#325 classification, retained worktrees, and validation for the 2026-05-31 13:57 integration run.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026
Selectively carries the same-backend Qwen3.5 layer-split disk prefix-cache snapshot export/adopt slice from PR Luce-Org#325 while leaving the mixed-backend runtime and Laguna cache work blocked on the larger PR Luce-Org#321 architecture reconciliation.

Also refreshes the auto-integration manifest/run log with the current PR classification and retained worktree notes.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026
Port the remaining narrow PR Luce-Org#325 disk-prefix-cache cleanup for current layout: allow lookups against layouts learned from disk, validate the adopted snapshot against the live backend layout, and reindex on mismatch. Refresh auto-integration metadata after reprobing current non-ancestor PRs.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026
Port a narrow PR Luce-Org#325 IPC robustness slice: honor explicit DFlash draft IPC auto transport, size shared payload capacity from live draft dimensions, and let backend IPC auto transport fall back to stream if shared setup is unavailable.\n\nValidation: git diff --check; conflict-marker search; Codex review reported no blocking findings. Local syntax-only compile remains blocked by missing dflash27b.h in this checkout.
easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 1, 2026
Selectively ports the same-backend Laguna layer-split prefix-cache disk snapshot surface from PR Luce-Org#325. Exports CPU-backed snapshot refs, adopts deserialized shard/logit tensors with temporary validation before taking ownership, and refreshes the integration manifest/probe log.
easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 1, 2026
Carry the next PR Luce-Org#325 selective-port slice by narrowing the server placement validation guard: same-backend target layer split may now use --kv-cache-dir after the disk snapshot/adopt paths already ported, while mixed-backend target layer split remains blocked until remote shard disk snapshot IPC export/import exists.\n\nUpdate the auto-integration manifest with current PR classification, probe results, and retained worktrees.\n\nValidation: git diff --check; conflict-marker search in promoted files; stack.yaml syntax check via file write linter; tmux Codex review reported no findings.
easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 1, 2026
Record the 2026-05-31 23:00 metadata/probe refresh: current PR-head containment, Luce-Org#321/Luce-Org#325 conflict probes, and tmux Claude/Codex read-only delegation outcomes. No source changes were promoted.
easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 1, 2026
Record the 2026-06-01 00:28 unattended probe pass. No source changes were promoted; Luce-Org#321 still needs live Qwen35 mixed-target adapter wiring, while Luce-Org#325's non-Luce-Org#321 same-backend disk-prefix-cache behavior is represented pending Luce-Org#321 mixed-target wiring.
easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 1, 2026
Record the 2026-06-01 unattended PR integration pass, updated PR Luce-Org#285 head containment, current selective-port conflict counts, and delegated review conclusions for the remaining Luce-Org#321/Luce-Org#325/Luce-Org#135 slices.
easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 1, 2026
Record the 2026-06-01 03:25 unattended refresh, including exact open PR head containment, direct-merge conflict counts, and read-only Luce-Org#325 delegation results. No source changes were promoted.
easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 1, 2026
easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 1, 2026
Record current heads for PR Luce-Org#321 and Luce-Org#325 as represented by the auto-integration stack after direct merge and tmux-delegated conflict-resolution attempts confirmed the remaining diffs are already carried by current-layout port commits.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant