feat(server): add DFlash disk prefix cache for target layer split by weicj · Pull Request #325 · Luce-Org/lucebox-hub

weicj · 2026-05-31T16:54:55Z

Summary

This PR restores disk prefix-cache support for same-backend target layer split, including the DFlash draft path. Target layer split already had in-process prefix-cache support, but the restart-persistent --kv-cache-dir path was still blocked; this PR brings the split+DFlash path back to parity with the single-backend disk-backed prefix restore behavior.

Previously, dflash_server rejected --kv-cache-dir when --target-devices was enabled because the existing disk snapshot format only matched a single backend snapshot. Under target layer split, the live prefix state is sharded across multiple target shards, so a disk hit could not be safely exported, loaded after restart, and rebound back into the split backend.

This change adds a flattened layer-split disk snapshot for the same-backend path. Each shard snapshot is exported into one disk-owned CPU snapshot with shard-prefixed tensor names, then adopted back into the shard-local snapshot slots on cache lookup. For DFlash, the snapshot also persists the draft feature mirror metadata and feature rows, so a restored target prefix can continue speculative decode instead of falling back to a target-only cache state.

Changes

Adds snapshot_ref() and snapshot_adopt() hooks to the LayerSplitAdapter / LayerSplitBackend boundary, so the server disk-cache layer can save and load snapshots through the generic backend interface.
Adds Qwen35 target-layer-split disk snapshot export/import:
- flattens shard-local snapshot tensors into one CPU snapshot for disk persistence;
- prefixes shard tensors as ls<shard>_<tensor-name> so they can be rebound to the correct target shard on load;
- stores and restores the global snap_prefill_logits tensor used by prefix restore;
- stores and restores DFlash feature-mirror state through dflash_feature_meta and dflash_feature_data;
- validates only the tensors required by each shard's actual cache state;
- keeps shared adopted ctx / buf ownership from being double-freed.
Updates disk-cache cold-start lookup so a layout learned from disk can be verified after successful backend adoption, instead of blocking the first restart-time lookup before the live layout is available.
Updates server placement validation to allow --kv-cache-dir with same-backend --target-devices.

Notes

Runtime validation passed on dual Pro VII / ROCm 6.3.3 with Qwen3.6-27B Q4 target and the DFlash Qwen3.6 draft, split across hip:0,hip:1. The first server process saved disk prefix cache; after restart, the second process logged [target-split] adopted disk snapshot, disk_hit=true, restore=true, and DFlash speculative decode with accepted draft tokens.
Mixed-backend target split disk cache remains a follow-up because remote shard snapshots need IPC export/import support.

cubic-dev-ai

8 issues found across 44 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/src/common/dflash_draft_ipc.cpp">

<violation number="1" location="server/src/common/dflash_draft_ipc.cpp:39">
P3: The new IPC env/size helper block duplicates existing helper logic from `qwen35_target_shard_ipc.cpp`; extract/shared utility should be used to avoid behavior drift across IPC clients.</violation>
</file>

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

cubic-dev-ai · 2026-05-31T17:34:38Z

+    return transport;
+}
+
+bool checked_mul_size(size_t a, size_t b, size_t & out) {


P3: The new IPC env/size helper block duplicates existing helper logic from qwen35_target_shard_ipc.cpp; extract/shared utility should be used to avoid behavior drift across IPC clients.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/dflash_draft_ipc.cpp, line 39: <comment>The new IPC env/size helper block duplicates existing helper logic from `qwen35_target_shard_ipc.cpp`; extract/shared utility should be used to avoid behavior drift across IPC clients.</comment> <file context> @@ -11,12 +11,71 @@ + return transport; +} + +bool checked_mul_size(size_t a, size_t b, size_t & out) { + if (a != 0 && b > std::numeric_limits<size_t>::max() / a) { + return false; </file context>

Record the Luce-Org#291/Luce-Org#290 draft-residency integration, newly non-draft Luce-Org#321/Luce-Org#325 classification, validation, and retained worktree/transcript paths for the May 31 13:30 UTC run.

cubic-dev-ai

2 issues found across 8 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/src/laguna/laguna_layer_split_adapter.cpp">

<violation number="1" location="server/src/laguna/laguna_layer_split_adapter.cpp:507">
P1: `snapshot_adopt` shares `ctx/buf` across shard snapshots before validation, so failure paths can double-free the adopted snapshot memory.</violation>
</file>

<file name="server/src/common/dflash_draft_ipc_daemon.cpp">

<violation number="1">
P1: Continuing after a malformed `propose_pipe` leaves unread payload bytes in the pipe, which can desynchronize subsequent requests.</violation>
</file>

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

cubic-dev-ai · 2026-05-31T18:00:19Z

+    for (auto & shard_snap : snap.shards) {
+        shard_snap.attn_k.assign(shards_.empty() ? 0 : shards_.front().weights.n_layer, nullptr);
+        shard_snap.attn_v.assign(shards_.empty() ? 0 : shards_.front().weights.n_layer, nullptr);
+        shard_snap.ctx = ctx;


P1: snapshot_adopt shares ctx/buf across shard snapshots before validation, so failure paths can double-free the adopted snapshot memory.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/laguna/laguna_layer_split_adapter.cpp, line 507: <comment>`snapshot_adopt` shares `ctx/buf` across shard snapshots before validation, so failure paths can double-free the adopted snapshot memory.</comment> <file context> @@ -332,6 +367,203 @@ bool LagunaLayerSplitAdapter::snapshot_restore(int slot) { + for (auto & shard_snap : snap.shards) { + shard_snap.attn_k.assign(shards_.empty() ? 0 : shards_.front().weights.n_layer, nullptr); + shard_snap.attn_v.assign(shards_.empty() ? 0 : shards_.front().weights.n_layer, nullptr); + shard_snap.ctx = ctx; + shard_snap.buf = buf; + shard_snap.cur_pos = cur_pos; </file context>

cubic-dev-ai · 2026-05-31T18:00:19Z

@@ -21,11 +21,17 @@
 #include <cstddef>


P1: Continuing after a malformed propose_pipe leaves unread payload bytes in the pipe, which can desynchronize subsequent requests.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/dflash_draft_ipc_daemon.cpp, line 454: <comment>Continuing after a malformed `propose_pipe` leaves unread payload bytes in the pipe, which can desynchronize subsequent requests.</comment> <file context> @@ -451,7 +451,7 @@ int run_dflash_draft_ipc_daemon(const char * draft_path, line.c_str()); stream_status(stream_fd, -1); - break; + continue; } if (!read_exact_fd(payload_fd, noise_embed.data(), bytes)) { </file context>

Suggested change

#include <cstddef>

break;

Record the exact Luce-Org#290/Luce-Org#291 merges, current Luce-Org#321/Luce-Org#325 classification, retained worktrees, and validation for the 2026-05-31 13:57 integration run.

Selectively carries the same-backend Qwen3.5 layer-split disk prefix-cache snapshot export/adopt slice from PR Luce-Org#325 while leaving the mixed-backend runtime and Laguna cache work blocked on the larger PR Luce-Org#321 architecture reconciliation. Also refreshes the auto-integration manifest/run log with the current PR classification and retained worktree notes.

Port the remaining narrow PR Luce-Org#325 disk-prefix-cache cleanup for current layout: allow lookups against layouts learned from disk, validate the adopted snapshot against the live backend layout, and reindex on mismatch. Refresh auto-integration metadata after reprobing current non-ancestor PRs.

Port a narrow PR Luce-Org#325 IPC robustness slice: honor explicit DFlash draft IPC auto transport, size shared payload capacity from live draft dimensions, and let backend IPC auto transport fall back to stream if shared setup is unavailable.\n\nValidation: git diff --check; conflict-marker search; Codex review reported no blocking findings. Local syntax-only compile remains blocked by missing dflash27b.h in this checkout.

Selectively ports the same-backend Laguna layer-split prefix-cache disk snapshot surface from PR Luce-Org#325. Exports CPU-backed snapshot refs, adopts deserialized shard/logit tensors with temporary validation before taking ownership, and refreshes the integration manifest/probe log.

Carry the next PR Luce-Org#325 selective-port slice by narrowing the server placement validation guard: same-backend target layer split may now use --kv-cache-dir after the disk snapshot/adopt paths already ported, while mixed-backend target layer split remains blocked until remote shard disk snapshot IPC export/import exists.\n\nUpdate the auto-integration manifest with current PR classification, probe results, and retained worktrees.\n\nValidation: git diff --check; conflict-marker search in promoted files; stack.yaml syntax check via file write linter; tmux Codex review reported no findings.

Record the 2026-05-31 23:00 metadata/probe refresh: current PR-head containment, Luce-Org#321/Luce-Org#325 conflict probes, and tmux Claude/Codex read-only delegation outcomes. No source changes were promoted.

Record the 2026-06-01 00:28 unattended probe pass. No source changes were promoted; Luce-Org#321 still needs live Qwen35 mixed-target adapter wiring, while Luce-Org#325's non-Luce-Org#321 same-backend disk-prefix-cache behavior is represented pending Luce-Org#321 mixed-target wiring.

Record the 2026-06-01 unattended PR integration pass, updated PR Luce-Org#285 head containment, current selective-port conflict counts, and delegated review conclusions for the remaining Luce-Org#321/Luce-Org#325/Luce-Org#135 slices.

Record the 2026-06-01 03:25 unattended refresh, including exact open PR head containment, direct-merge conflict counts, and read-only Luce-Org#325 delegation results. No source changes were promoted.

Record current heads for PR Luce-Org#321 and Luce-Org#325 as represented by the auto-integration stack after direct merge and tmux-delegated conflict-resolution attempts confirmed the remaining diffs are already carried by current-layout port commits.

weicj added 6 commits May 31, 2026 19:09

fix(server): enable sampling for target layer split

a282430

feat(server): add Laguna target-layer-split adapter

c1e4afe

refactor(server): share target layer-split runtime helpers

5ae5fa4

feat(server): add selectable backend IPC payload transport

09fe1ae

feat(server): support DFlash with mixed-backend target split

71b3e98

feat(server): add DFlash disk prefix cache for target layer split

a043547

weicj marked this pull request as ready for review May 31, 2026 17:24

cubic-dev-ai Bot reviewed May 31, 2026

View reviewed changes

fix(server): complete layer split disk cache cleanup

b47fb3a

cubic-dev-ai Bot reviewed May 31, 2026

View reviewed changes

easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 1, 2026

Merge PR Luce-Org#325 as represented in auto-integration

bcf5462

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(server): add DFlash disk prefix cache for target layer split#325

feat(server): add DFlash disk prefix cache for target layer split#325
weicj wants to merge 7 commits into
Luce-Org:mainfrom
weicj:feat-layer-split-disk-prefix-cache

weicj commented May 31, 2026

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot May 31, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot May 31, 2026

Uh oh!

cubic-dev-ai Bot May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

weicj commented May 31, 2026

Summary

Changes

Notes

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot May 31, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 31, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cubic-dev-ai Bot left a comment •

edited

Loading