Skip to content

feat(server): support DFlash with mixed-backend target layer split#321

Open
weicj wants to merge 6 commits into
Luce-Org:mainfrom
weicj:feat-mixed-backend-layer-split-runtime
Open

feat(server): support DFlash with mixed-backend target layer split#321
weicj wants to merge 6 commits into
Luce-Org:mainfrom
weicj:feat-mixed-backend-layer-split-runtime

Conversation

@weicj
Copy link
Copy Markdown
Collaborator

@weicj weicj commented May 31, 2026

Summary

This PR lets target layer split run across different backends and completes the DFlash speculative decode path on top of mixed-backend target split.

Same-backend layer split could already shard the target across multiple GPUs from the same backend, but CUDA/HIP mixed placement was limited to draft/target separation. The target itself could not be split across backend processes. DFlash also needs more than a plain target forward: verify requires hidden-state capture, draft feature ring or remote draft IPC forwarding, target KV snapshot/restore, and final token projection. This PR wires those required DFlash pieces into the mixed target shard IPC path.

Changes

  • Add a mixed-backend target shard IPC path: the local shard can run the first target layers and hand the boundary activation to another backend process for the remaining target layers.
  • Let the remote target shard return DFlash capture slices; mixed forward writes local/remote captures into the local DraftFeatureMirror or forwards them to remote draft IPC.
  • Add target KV snapshot / restore support to the remote target shard for DFlash speculative verify rollback.
  • Add hidden-state-to-token projection on the remote target shard so DFlash target split can finish token decisions when the final layers / LM head live remotely.
  • Support in-memory prefix cache for mixed target split: the local shard keeps its local prefix snapshot, while the remote target shard keeps the matching slot inside its own backend process and restores it through IPC control commands instead of transferring large KV payloads.
  • Extend server placement validation from requiring all target-split shards to run on the current compiled backend to allowing one local backend group plus one remote backend group when --target-shard-ipc-bin is provided.
  • Keep same-backend target layer split on the existing in-process local runtime path.

Notes

  • Local runtime validation covered CUDA Tesla P4 + dual HIP Pro VII with Qwen3.6-27B Q4 target split across cuda:0,hip:0,hip:1 and layer split 0.08,0.46,0.46; logs show CUDA running layers [0,5), and the two HIP shards running [5,35) and [35,64).
  • DFlash validation covered both local CUDA draft and remote HIP draft IPC modes; both returned valid OpenAI-compatible responses and server logs reported accepted draft tokens.
  • In-memory prefix cache was validated on the 27B DFlash mixed-target-split path: the first request committed an inline snapshot, and the second identical request hit restore=true.

easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026
Record draft PR Luce-Org#321 from the final PR re-enumeration and confirm no new non-draft PR appeared after the auto-integration push.
@weicj weicj force-pushed the feat-mixed-backend-layer-split-runtime branch from 72107e2 to 71b3e98 Compare May 31, 2026 11:21
@weicj weicj marked this pull request as ready for review May 31, 2026 17:24
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 issues found across 43 files

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread server/src/qwen35/qwen35_layer_split_adapter.cpp
Comment thread server/src/common/dflash_feature_ring.cpp Outdated
Comment thread server/src/qwen35/qwen35_layer_split_dflash_target.cpp Outdated
Comment thread server/src/common/dflash_draft_ipc.cpp
Comment thread server/src/common/dflash_draft_ipc_daemon.cpp
Comment thread server/src/laguna/laguna_target_loader.cpp
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026
Record the Luce-Org#291/Luce-Org#290 draft-residency integration, newly non-draft Luce-Org#321/Luce-Org#325 classification, validation, and retained worktree/transcript paths for the May 31 13:30 UTC run.
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 10 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/src/qwen35/qwen35_target_shard_ipc.cpp">

<violation number="1" location="server/src/qwen35/qwen35_target_shard_ipc.cpp:60">
P2: The negative-value guard only checks `raw[0]`, so signed inputs with leading whitespace (for example `"   -1"`) still pass through `strtoull` and can produce an unintended huge shared-memory size.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment on lines +60 to +64
if (raw[0] == '-') {
return required_bytes;
}
char * end = nullptr;
const unsigned long long parsed = std::strtoull(raw, &end, 10);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: The negative-value guard only checks raw[0], so signed inputs with leading whitespace (for example " -1") still pass through strtoull and can produce an unintended huge shared-memory size.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35/qwen35_target_shard_ipc.cpp, line 60:

<comment>The negative-value guard only checks `raw[0]`, so signed inputs with leading whitespace (for example `"   -1"`) still pass through `strtoull` and can produce an unintended huge shared-memory size.</comment>

<file context>
@@ -57,9 +57,13 @@ size_t target_shard_shared_bytes_from_env(size_t required_bytes) {
     if (!raw || !*raw) {
         return required_bytes;
     }
+    if (raw[0] == '-') {
+        return required_bytes;
+    }
</file context>
Suggested change
if (raw[0] == '-') {
return required_bytes;
}
char * end = nullptr;
const unsigned long long parsed = std::strtoull(raw, &end, 10);
const char * p = raw;
while (*p == ' ' || *p == '\t' || *p == '\n' || *p == '\r' || *p == '\f' || *p == '\v') {
++p;
}
if (*p == '-') {
return required_bytes;
}
char * end = nullptr;
const unsigned long long parsed = std::strtoull(p, &end, 10);
if (end == p || *end != '\0' ||
parsed > (unsigned long long)std::numeric_limits<size_t>::max()) {
return required_bytes;
}

easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026
Record the exact Luce-Org#290/Luce-Org#291 merges, current Luce-Org#321/Luce-Org#325 classification, retained worktrees, and validation for the 2026-05-31 13:57 integration run.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026
Selectively carries the same-backend Qwen3.5 layer-split disk prefix-cache snapshot export/adopt slice from PR Luce-Org#325 while leaving the mixed-backend runtime and Laguna cache work blocked on the larger PR Luce-Org#321 architecture reconciliation.

Also refreshes the auto-integration manifest/run log with the current PR classification and retained worktree notes.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026
Record the 2026-05-31 15:02 unattended run: exact open-PR containment, fresh conflict probe counts for the eight remaining non-ancestor candidates, and the tmux-driven Luce-Org#321 Claude/Codex read-only attempts. No source changes were promoted.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026
Carry the conflict-free PR Luce-Org#321 placement foundation over the current auto-integration stack. DevicePlacement now records per-shard backends, parses mixed backend layer-split device lists, validates duplicate devices by backend plus GPU, and extends placement unit coverage.\n\nThe target-shard IPC/runtime pieces remain documented as pending selective-port work.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026
Port a narrow PR Luce-Org#321 control-plane slice by adding RemoteTargetShardConfig, threading it through BackendArgs, and parsing/printing the target-shard IPC CLI options without enabling mixed-backend execution yet. Refresh the auto-integration manifest with current probe/delegation results.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026
Port a narrow PR Luce-Org#321 runtime slice into the auto-integration stack: resolve a null-safe log prefix once, use it consistently for layer-split runtime diagnostics/snapshot setup, and stamp shard metadata with each configured per-shard placement backend.\n\nAlso refresh the auto-integration manifest with current PR classification, probe counts, retained worktrees, and validation notes.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026
Carry the next narrow PR Luce-Org#321 slice by passing the staged RemoteTargetShardConfig from BackendArgs into Qwen35LayerSplitAdapterConfig. Also add the LayerSplitShardMeta placement_backend field required by the previously ported runtime metadata slice.\n\nValidation: git diff --check; conflict-marker scan on promoted source files; stub g++ syntax smoke for LayerSplitShardMeta::placement_backend. Full CMake remains locally blocked by missing server deps/CUDA compiler-id environment issues.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026
Selective-port a no-op-safe slice from PR Luce-Org#321 by adding the backend IPC mode parse/name surface and declaration-only Qwen35 target-shard IPC client/daemon contract. Runtime implementation, CMake wiring, daemon dispatch, and mixed-backend activation remain intentionally deferred until the broader layer-split conflicts are reconciled.

Update the auto-integration manifest with current PR classifications, retained worktrees, validation, and Codex delegation evidence.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026
Port the inert PR Luce-Org#321 target-shard IPC client implementation and register it with dflash_common. The client remains unactivated until daemon dispatch and runtime adapter wiring are reconciled.

Validation: git diff --check; conflict-marker search; YAML parse. Local syntax probing remains blocked by the missing vendored ggml-backend.h dependency in this checkout.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026
Port a narrow Luce-Org#321 current-layout slice by making inactive Qwen35 target-shard IPC state/snapshot calls no-op successes. This lets future runtime adapter hooks call snapshot/reset/restore helpers safely before the mixed-backend target-shard client is active.\n\nUpdate auto-integration manifest with current PR containment, probe results, Codex delegation outcome, validation, and retained worktrees.
easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 1, 2026
Record the 2026-05-31 19:49 cron preflight, current PR containment, direct-merge probe counts, and the unpromoted PR Luce-Org#321 daemon-dispatch attempt blocked by the missing current-layout forward-from-activation helper.
easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 1, 2026
Port a narrow PR Luce-Org#321 safety guard into the current stack: invalid capture layer indices, invalid positions, non-positive ring capacity, and invalid hidden size now fail instead of silently no-oping during DFlash feature-ring capture copies. Refresh auto-integration metadata with current PR containment and probe results.
easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 1, 2026
Selectively ports the next inert PR Luce-Org#321 target-shard IPC prerequisite onto auto-integration. Adds a Qwen35 layer-split forward-from-activation entry point with boundary activation validation, explicit ActivationPair ownership semantics, and F32 capture guards while leaving daemon dispatch and adapter wiring deferred. Refreshes the auto-integration manifest with the 22:04 probe results.
easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 1, 2026
Record the 2026-05-31 23:00 metadata/probe refresh: current PR-head containment, Luce-Org#321/Luce-Org#325 conflict probes, and tmux Claude/Codex read-only delegation outcomes. No source changes were promoted.
easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 1, 2026
Record PR Luce-Org#326 integration, current PR-head coverage, retained conflict probes, and Luce-Org#321 target-shard IPC feasibility findings.
easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 1, 2026
Record the 2026-06-01 00:28 unattended probe pass. No source changes were promoted; Luce-Org#321 still needs live Qwen35 mixed-target adapter wiring, while Luce-Org#325's non-Luce-Org#321 same-backend disk-prefix-cache behavior is represented pending Luce-Org#321 mixed-target wiring.
easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 1, 2026
Record the 2026-06-01 unattended PR integration pass, updated PR Luce-Org#285 head containment, current selective-port conflict counts, and delegated review conclusions for the remaining Luce-Org#321/Luce-Org#325/Luce-Org#135 slices.
easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 1, 2026
Record the 2026-06-01 02:48 unattended probe pass, current PR containment, direct-merge conflict counts, and retained Luce-Org#321 Codex transcript. No source changes were promoted.
easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 1, 2026
Record the 2026-06-01 03:07 unattended probe run, including current PR-head containment, direct-merge conflict counts, the Codex Luce-Org#321 read-only delegation outcome, validation, and retained worktree paths.
easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 1, 2026
easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 1, 2026
Record current heads for PR Luce-Org#321 and Luce-Org#325 as represented by the auto-integration stack after direct merge and tmux-delegated conflict-resolution attempts confirmed the remaining diffs are already carried by current-layout port commits.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant