Skip to content

feat(server): add target-layer-split backend adapter path#265

Merged
davide221 merged 7 commits into
Luce-Org:mainfrom
weicj:feat-cpp-server-target-layer-split-prep
May 28, 2026
Merged

feat(server): add target-layer-split backend adapter path#265
davide221 merged 7 commits into
Luce-Org:mainfrom
weicj:feat-cpp-server-target-layer-split-prep

Conversation

@weicj

@weicj weicj commented May 23, 2026

Copy link
Copy Markdown
Collaborator

Summary

This PR exposes target layer split through the native C++ server and makes the layer-split adapter path reusable instead of qwen35-owned.

Before this change, the repository already had qwen35 target-layer split machinery in the bench / daemon path, but dflash_server still rejected --target-devices and --target-layer-split. The split implementation also carried some generic concepts inside qwen35-specific code, which would make Gemma4 and future model adapters repeat the same load-plan, shard metadata, peer-access, and snapshot setup logic.

This PR moves the shared target-layer-split flow into a generic server-facing backend and common layer-split scaffold. qwen35 becomes the first concrete adapter on top of that scaffold.

Changes

  1. Add the generic server-facing layer-split backend
  • Adds LayerSplitBackend for the shared target-sharding request flow used by dflash_server.
  • Adds LayerSplitAdapter, so each model family supplies only its model-specific partial load, cache, forward, compression, and optional spec-decode behavior.
  • Keeps target-layer split same-backend for this PR: CUDA shards in a CUDA build, HIP shards in a HIP build. Cross-backend target layer split remains an experimental path for later work.
  1. Move generic split metadata and setup into common code
  • Adds common LayerSplitRange, LayerSplitLoadPlan, and LayerSplitShardMeta.
  • Adds shared helpers for layer range computation, shard metadata initialization, peer-access setup, snapshot-backend init/free, load-plan construction, and shard lookup by layer id.
  • Removes the qwen35-owned generic-looking TargetLoadPlan / TargetLayerSplitShard concepts from the shared flow.
  1. Wire qwen35 as the first concrete adapter
  • Adds Qwen35LayerSplitAdapter and routes qwen35 multi-GPU placement through LayerSplitBackend(Qwen35LayerSplitAdapter).
  • Keeps qwen35 single-GPU serving on the existing Qwen35Backend.
  • Renames qwen35 split payload and functions to qwen35-specific names, for example Qwen35LayerSplitShard and run_qwen35_layer_split_forward.
  • Leaves model-specific data in qwen35: target weights, target cache, step graph, qwen35 forward logic, and qwen35 DFlash target wrapping.
  1. Preserve server features on split targets
  • Enables target-only generation on split targets.
  • Keeps fresh DFlash generation working with local same-backend draft and existing remote draft IPC placement.
  • Lets PFlash compression run before split-target generation by feeding the compressed prompt into the sharded target.
  • Adds per-shard in-memory prefix snapshots for target-only and PFlash paths.
  • Preserves DFlash on split-target prefix-cache restore by snapshotting/restoring the draft feature tail together with the target shard snapshots, including the remote-draft IPC path.
  • Adds remote draft IPC feature-range get/set commands so a restored target snapshot can also restore the draft-side feature state needed for speculative decode without copying the whole ring.
  • Only confirms an inline prefix-cache entry after the backend reports that the snapshot slot is actually usable, avoiding false prefix-cache hits when a split snapshot fails.
  1. Tighten split-target failure and cleanup behavior
  • Rejects --kv-cache-dir with --target-devices instead of silently clearing the disk-cache setting. Disk prefix cache still needs a sharded snapshot format before it can safely support split targets.
  • Validates PFlash compression keep_ratio before passing the request into model adapters.
  • Carries draft_swa_window into the qwen35 layer-split adapter so split and non-split qwen35 paths stay aligned for that runtime setting.
  • Makes LayerSplitBackend::shutdown() idempotent and clears qwen35 shard state during teardown.

Notes

  • Additional adapters, including Gemma4, can attach to the same LayerSplitBackend by defining only their model-specific shard payload, partial loader, cache, forward, snapshot, and optional DFlash/PFlash hooks.
  • Disk prefix cache for split targets is intentionally left for a follow-up memory-optimization PR, because the current on-disk snapshot format serializes a single target snapshot and needs a sharded format extension before it can safely persist multi-shard snapshots.
  • CUDA and HIP/ROCm 6.3.3 builds pass, and test_server_unit passes on both builds.
  • Runtime smoke passed on dual Pro VII/gfx906 split target with Tesla P4 CUDA remote draft IPC and a qwen35-family 21B IQ1 target: repeated-prefix in-memory cache hit/restore continued through DFlash speculative decode instead of falling back to AR.

@weicj weicj force-pushed the feat-cpp-server-target-layer-split-prep branch from 80108ba to 902cf3b Compare May 23, 2026 13:12
@weicj weicj marked this pull request as ready for review May 23, 2026 15:58

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 27 files

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread dflash/src/common/layer_split_backend.cpp Outdated
Comment thread dflash/src/common/backend_ipc.cpp Outdated
@weicj weicj force-pushed the feat-cpp-server-target-layer-split-prep branch from 902cf3b to 6fd8d9e Compare May 23, 2026 16:53
@weicj weicj marked this pull request as draft May 24, 2026 06:28
@weicj weicj force-pushed the feat-cpp-server-target-layer-split-prep branch from ed8d5e2 to 73c4a85 Compare May 28, 2026 06:21
@weicj weicj marked this pull request as ready for review May 28, 2026 07:07

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 issues found across 32 files

You’re at about 91% of the monthly reviewed-line limit. You may want to disable incremental reviews to conserve quota. Reviews will continue until that limit is exceeded. If you need help avoiding interruptions, please contact contact@cubic.dev.

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/src/common/backend_factory.cpp">

<violation number="1" location="server/src/common/backend_factory.cpp:55">
P2: Layer-split qwen35 path silently drops multiple runtime/decode options present in non-split path</violation>
</file>

<file name="server/test/test_server_unit.cpp">

<violation number="1" location="server/test/test_server_unit.cpp:1366">
P2: Restore-path test expects DFlash on snapshot restore, but the intended contract is AR fallback until shard replay exists.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread server/src/server/server_main.cpp Outdated
cfg.device = args.device;
cfg.draft_gpu = args.draft_device.gpu;
cfg.remote_draft = args.remote_draft;
cfg.fa_window = args.fa_window;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Layer-split qwen35 path silently drops multiple runtime/decode options present in non-split path

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/backend_factory.cpp, line 55:

<comment>Layer-split qwen35 path silently drops multiple runtime/decode options present in non-split path</comment>

<file context>
@@ -42,6 +45,30 @@ std::unique_ptr<ModelBackend> create_backend(const BackendArgs & args) {
+            cfg.device             = args.device;
+            cfg.draft_gpu          = args.draft_device.gpu;
+            cfg.remote_draft       = args.remote_draft;
+            cfg.fa_window          = args.fa_window;
+            cfg.kq_stride_pad      = args.kq_stride_pad;
+            cfg.draft_ctx_max      = args.draft_ctx_max;
</file context>

GenerateResult restored = backend.restore_and_generate(2, restore_req, io);

TEST_ASSERT(restored.ok);
TEST_ASSERT(raw->dflash_called);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Restore-path test expects DFlash on snapshot restore, but the intended contract is AR fallback until shard replay exists.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/test/test_server_unit.cpp, line 1366:

<comment>Restore-path test expects DFlash on snapshot restore, but the intended contract is AR fallback until shard replay exists.</comment>

<file context>
@@ -1184,6 +1188,211 @@ static void test_normalize_responses_tool_followup_messages() {
+    GenerateResult restored = backend.restore_and_generate(2, restore_req, io);
+
+    TEST_ASSERT(restored.ok);
+    TEST_ASSERT(raw->dflash_called);
+    TEST_ASSERT(raw->restored_slot == 2);
+    TEST_ASSERT(!raw->reset_called);
</file context>
Suggested change
TEST_ASSERT(raw->dflash_called);
TEST_ASSERT(!raw->dflash_called);

Comment thread server/src/common/layer_split_backend.cpp
Comment thread server/src/common/layer_split_backend.cpp
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 28, 2026
Record the clean integration of PRs Luce-Org#265 and Luce-Org#273, the refreshed conflict probes for the remaining selective-port PRs, and the current validation results.
@davide221 davide221 merged commit 43457d8 into Luce-Org:main May 28, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants