feat(server): support DFlash with mixed-backend target layer split by weicj · Pull Request #321 · Luce-Org/lucebox-hub

weicj · 2026-05-31T11:04:16Z

Summary

This PR lets target layer split run across different backends and completes the DFlash speculative decode path on top of mixed-backend target split.

Same-backend layer split could already shard the target across multiple GPUs from the same backend, but CUDA/HIP mixed placement was limited to draft/target separation. The target itself could not be split across backend processes. DFlash also needs more than a plain target forward: verify requires hidden-state capture, draft feature ring or remote draft IPC forwarding, target KV snapshot/restore, and final token projection. This PR wires those required DFlash pieces into the mixed target shard IPC path.

Changes

Add a mixed-backend target shard IPC path: the local shard can run the first target layers and hand the boundary activation to another backend process for the remaining target layers.
Let the remote target shard return DFlash capture slices; mixed forward writes local/remote captures into the local DraftFeatureMirror or forwards them to remote draft IPC.
Add target KV snapshot / restore support to the remote target shard for DFlash speculative verify rollback.
Add hidden-state-to-token projection on the remote target shard so DFlash target split can finish token decisions when the final layers / LM head live remotely.
Support in-memory prefix cache for mixed target split: the local shard keeps its local prefix snapshot, while the remote target shard keeps the matching slot inside its own backend process and restores it through IPC control commands instead of transferring large KV payloads.
Extend server placement validation from requiring all target-split shards to run on the current compiled backend to allowing one local backend group plus one remote backend group when --target-shard-ipc-bin is provided.
Keep same-backend target layer split on the existing in-process local runtime path.

Notes

Local runtime validation covered CUDA Tesla P4 + dual HIP Pro VII with Qwen3.6-27B Q4 target split across cuda:0,hip:0,hip:1 and layer split 0.08,0.46,0.46; logs show CUDA running layers [0,5), and the two HIP shards running [5,35) and [35,64).
DFlash validation covered both local CUDA draft and remote HIP draft IPC modes; both returned valid OpenAI-compatible responses and server logs reported accepted draft tokens.
In-memory prefix cache was validated on the 27B DFlash mixed-target-split path: the first request committed an inline snapshot, and the second identical request hit restore=true.

Record draft PR Luce-Org#321 from the final PR re-enumeration and confirm no new non-draft PR appeared after the auto-integration push.

cubic-dev-ai

6 issues found across 43 files

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

Record the Luce-Org#291/Luce-Org#290 draft-residency integration, newly non-draft Luce-Org#321/Luce-Org#325 classification, validation, and retained worktree/transcript paths for the May 31 13:30 UTC run.

cubic-dev-ai

1 issue found across 10 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/src/qwen35/qwen35_target_shard_ipc.cpp">

<violation number="1" location="server/src/qwen35/qwen35_target_shard_ipc.cpp:60">
P2: The negative-value guard only checks `raw[0]`, so signed inputs with leading whitespace (for example `"   -1"`) still pass through `strtoull` and can produce an unintended huge shared-memory size.</violation>
</file>

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

cubic-dev-ai · 2026-05-31T17:55:55Z

+    if (raw[0] == '-') {
+        return required_bytes;
+    }
+    char * end = nullptr;
+    const unsigned long long parsed = std::strtoull(raw, &end, 10);


P2: The negative-value guard only checks raw[0], so signed inputs with leading whitespace (for example " -1") still pass through strtoull and can produce an unintended huge shared-memory size.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35/qwen35_target_shard_ipc.cpp, line 60: <comment>The negative-value guard only checks `raw[0]`, so signed inputs with leading whitespace (for example `" -1"`) still pass through `strtoull` and can produce an unintended huge shared-memory size.</comment> <file context> @@ -57,9 +57,13 @@ size_t target_shard_shared_bytes_from_env(size_t required_bytes) { if (!raw || !*raw) { return required_bytes; } + if (raw[0] == '-') { + return required_bytes; + } </file context>

Suggested change

if (raw[0] == '-') {

return required_bytes;

}

char * end = nullptr;

const unsigned long long parsed = std::strtoull(raw, &end, 10);

const char * p = raw;

while (*p == ' ' || *p == '\t' || *p == '\n' || *p == '\r' || *p == '\f' || *p == '\v') {

++p;

}

if (*p == '-') {

return required_bytes;

}

char * end = nullptr;

const unsigned long long parsed = std::strtoull(p, &end, 10);

if (end == p || *end != '\0' ||

parsed > (unsigned long long)std::numeric_limits<size_t>::max()) {

return required_bytes;

}

Record the exact Luce-Org#290/Luce-Org#291 merges, current Luce-Org#321/Luce-Org#325 classification, retained worktrees, and validation for the 2026-05-31 13:57 integration run.

Selectively carries the same-backend Qwen3.5 layer-split disk prefix-cache snapshot export/adopt slice from PR Luce-Org#325 while leaving the mixed-backend runtime and Laguna cache work blocked on the larger PR Luce-Org#321 architecture reconciliation. Also refreshes the auto-integration manifest/run log with the current PR classification and retained worktree notes.

Record the 2026-05-31 15:02 unattended run: exact open-PR containment, fresh conflict probe counts for the eight remaining non-ancestor candidates, and the tmux-driven Luce-Org#321 Claude/Codex read-only attempts. No source changes were promoted.

Carry the conflict-free PR Luce-Org#321 placement foundation over the current auto-integration stack. DevicePlacement now records per-shard backends, parses mixed backend layer-split device lists, validates duplicate devices by backend plus GPU, and extends placement unit coverage.\n\nThe target-shard IPC/runtime pieces remain documented as pending selective-port work.

Port a narrow PR Luce-Org#321 control-plane slice by adding RemoteTargetShardConfig, threading it through BackendArgs, and parsing/printing the target-shard IPC CLI options without enabling mixed-backend execution yet. Refresh the auto-integration manifest with current probe/delegation results.

Port a narrow PR Luce-Org#321 runtime slice into the auto-integration stack: resolve a null-safe log prefix once, use it consistently for layer-split runtime diagnostics/snapshot setup, and stamp shard metadata with each configured per-shard placement backend.\n\nAlso refresh the auto-integration manifest with current PR classification, probe counts, retained worktrees, and validation notes.

Carry the next narrow PR Luce-Org#321 slice by passing the staged RemoteTargetShardConfig from BackendArgs into Qwen35LayerSplitAdapterConfig. Also add the LayerSplitShardMeta placement_backend field required by the previously ported runtime metadata slice.\n\nValidation: git diff --check; conflict-marker scan on promoted source files; stub g++ syntax smoke for LayerSplitShardMeta::placement_backend. Full CMake remains locally blocked by missing server deps/CUDA compiler-id environment issues.

Selective-port a no-op-safe slice from PR Luce-Org#321 by adding the backend IPC mode parse/name surface and declaration-only Qwen35 target-shard IPC client/daemon contract. Runtime implementation, CMake wiring, daemon dispatch, and mixed-backend activation remain intentionally deferred until the broader layer-split conflicts are reconciled. Update the auto-integration manifest with current PR classifications, retained worktrees, validation, and Codex delegation evidence.

Port the inert PR Luce-Org#321 target-shard IPC client implementation and register it with dflash_common. The client remains unactivated until daemon dispatch and runtime adapter wiring are reconciled. Validation: git diff --check; conflict-marker search; YAML parse. Local syntax probing remains blocked by the missing vendored ggml-backend.h dependency in this checkout.

Port a narrow Luce-Org#321 current-layout slice by making inactive Qwen35 target-shard IPC state/snapshot calls no-op successes. This lets future runtime adapter hooks call snapshot/reset/restore helpers safely before the mixed-backend target-shard client is active.\n\nUpdate auto-integration manifest with current PR containment, probe results, Codex delegation outcome, validation, and retained worktrees.

Record the 2026-05-31 19:49 cron preflight, current PR containment, direct-merge probe counts, and the unpromoted PR Luce-Org#321 daemon-dispatch attempt blocked by the missing current-layout forward-from-activation helper.

Port a narrow PR Luce-Org#321 safety guard into the current stack: invalid capture layer indices, invalid positions, non-positive ring capacity, and invalid hidden size now fail instead of silently no-oping during DFlash feature-ring capture copies. Refresh auto-integration metadata with current PR containment and probe results.

Selectively ports the next inert PR Luce-Org#321 target-shard IPC prerequisite onto auto-integration. Adds a Qwen35 layer-split forward-from-activation entry point with boundary activation validation, explicit ActivationPair ownership semantics, and F32 capture guards while leaving daemon dispatch and adapter wiring deferred. Refreshes the auto-integration manifest with the 22:04 probe results.

Record the 2026-05-31 23:00 metadata/probe refresh: current PR-head containment, Luce-Org#321/Luce-Org#325 conflict probes, and tmux Claude/Codex read-only delegation outcomes. No source changes were promoted.

Record PR Luce-Org#326 integration, current PR-head coverage, retained conflict probes, and Luce-Org#321 target-shard IPC feasibility findings.

Record the 2026-06-01 00:28 unattended probe pass. No source changes were promoted; Luce-Org#321 still needs live Qwen35 mixed-target adapter wiring, while Luce-Org#325's non-Luce-Org#321 same-backend disk-prefix-cache behavior is represented pending Luce-Org#321 mixed-target wiring.

Record the 2026-06-01 unattended PR integration pass, updated PR Luce-Org#285 head containment, current selective-port conflict counts, and delegated review conclusions for the remaining Luce-Org#321/Luce-Org#325/Luce-Org#135 slices.

Record the 2026-06-01 02:48 unattended probe pass, current PR containment, direct-merge conflict counts, and retained Luce-Org#321 Codex transcript. No source changes were promoted.

Record the 2026-06-01 03:07 unattended probe run, including current PR-head containment, direct-merge conflict counts, the Codex Luce-Org#321 read-only delegation outcome, validation, and retained worktree paths.

Record current heads for PR Luce-Org#321 and Luce-Org#325 as represented by the auto-integration stack after direct merge and tmux-delegated conflict-resolution attempts confirmed the remaining diffs are already carried by current-layout port commits.

weicj added 3 commits May 31, 2026 19:09

fix(server): enable sampling for target layer split

a282430

feat(server): add Laguna target-layer-split adapter

c1e4afe

refactor(server): share target layer-split runtime helpers

5ae5fa4

easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026

docs: note post-push draft PR

b5aea57

Record draft PR Luce-Org#321 from the final PR re-enumeration and confirm no new non-draft PR appeared after the auto-integration push.

weicj added 2 commits May 31, 2026 19:14

feat(server): add selectable backend IPC payload transport

09fe1ae

feat(server): support DFlash with mixed-backend target split

71b3e98

weicj force-pushed the feat-mixed-backend-layer-split-runtime branch from 72107e2 to 71b3e98 Compare May 31, 2026 11:21

weicj marked this pull request as ready for review May 31, 2026 17:24

cubic-dev-ai Bot reviewed May 31, 2026

View reviewed changes

fix(server): harden mixed target split IPC state

87fe765

cubic-dev-ai Bot reviewed May 31, 2026

View reviewed changes

easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 1, 2026

docs: refresh auto-integration checkpoint

545048f

Record PR Luce-Org#326 integration, current PR-head coverage, retained conflict probes, and Luce-Org#321 target-shard IPC feasibility findings.

easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 1, 2026

Merge PR Luce-Org#321 as represented in auto-integration

aa105c9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(server): support DFlash with mixed-backend target layer split#321

feat(server): support DFlash with mixed-backend target layer split#321
weicj wants to merge 6 commits into
Luce-Org:mainfrom
weicj:feat-mixed-backend-layer-split-runtime

weicj commented May 31, 2026

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

weicj commented May 31, 2026

Summary

Changes

Notes

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cubic-dev-ai Bot left a comment •

edited

Loading