feat(server): add target-layer-split backend adapter path by weicj · Pull Request #265 · Luce-Org/lucebox-hub

weicj · 2026-05-23T11:45:01Z

Summary

This PR exposes target layer split through the native C++ server and makes the layer-split adapter path reusable instead of qwen35-owned.

Before this change, the repository already had qwen35 target-layer split machinery in the bench / daemon path, but dflash_server still rejected --target-devices and --target-layer-split. The split implementation also carried some generic concepts inside qwen35-specific code, which would make Gemma4 and future model adapters repeat the same load-plan, shard metadata, peer-access, and snapshot setup logic.

This PR moves the shared target-layer-split flow into a generic server-facing backend and common layer-split scaffold. qwen35 becomes the first concrete adapter on top of that scaffold.

Changes

Add the generic server-facing layer-split backend

Adds LayerSplitBackend for the shared target-sharding request flow used by dflash_server.
Adds LayerSplitAdapter, so each model family supplies only its model-specific partial load, cache, forward, compression, and optional spec-decode behavior.
Keeps target-layer split same-backend for this PR: CUDA shards in a CUDA build, HIP shards in a HIP build. Cross-backend target layer split remains an experimental path for later work.

Move generic split metadata and setup into common code

Adds common LayerSplitRange, LayerSplitLoadPlan, and LayerSplitShardMeta.
Adds shared helpers for layer range computation, shard metadata initialization, peer-access setup, snapshot-backend init/free, load-plan construction, and shard lookup by layer id.
Removes the qwen35-owned generic-looking TargetLoadPlan / TargetLayerSplitShard concepts from the shared flow.

Wire qwen35 as the first concrete adapter

Adds Qwen35LayerSplitAdapter and routes qwen35 multi-GPU placement through LayerSplitBackend(Qwen35LayerSplitAdapter).
Keeps qwen35 single-GPU serving on the existing Qwen35Backend.
Renames qwen35 split payload and functions to qwen35-specific names, for example Qwen35LayerSplitShard and run_qwen35_layer_split_forward.
Leaves model-specific data in qwen35: target weights, target cache, step graph, qwen35 forward logic, and qwen35 DFlash target wrapping.

Preserve server features on split targets

Enables target-only generation on split targets.
Keeps fresh DFlash generation working with local same-backend draft and existing remote draft IPC placement.
Lets PFlash compression run before split-target generation by feeding the compressed prompt into the sharded target.
Adds per-shard in-memory prefix snapshots for target-only and PFlash paths.
Preserves DFlash on split-target prefix-cache restore by snapshotting/restoring the draft feature tail together with the target shard snapshots, including the remote-draft IPC path.
Adds remote draft IPC feature-range get/set commands so a restored target snapshot can also restore the draft-side feature state needed for speculative decode without copying the whole ring.
Only confirms an inline prefix-cache entry after the backend reports that the snapshot slot is actually usable, avoiding false prefix-cache hits when a split snapshot fails.

Tighten split-target failure and cleanup behavior

Rejects --kv-cache-dir with --target-devices instead of silently clearing the disk-cache setting. Disk prefix cache still needs a sharded snapshot format before it can safely support split targets.
Validates PFlash compression keep_ratio before passing the request into model adapters.
Carries draft_swa_window into the qwen35 layer-split adapter so split and non-split qwen35 paths stay aligned for that runtime setting.
Makes LayerSplitBackend::shutdown() idempotent and clears qwen35 shard state during teardown.

Notes

Additional adapters, including Gemma4, can attach to the same LayerSplitBackend by defining only their model-specific shard payload, partial loader, cache, forward, snapshot, and optional DFlash/PFlash hooks.
Disk prefix cache for split targets is intentionally left for a follow-up memory-optimization PR, because the current on-disk snapshot format serializes a single target snapshot and needs a sharded format extension before it can safely persist multi-shard snapshots.
CUDA and HIP/ROCm 6.3.3 builds pass, and test_server_unit passes on both builds.
Runtime smoke passed on dual Pro VII/gfx906 split target with Tesla P4 CUDA remote draft IPC and a qwen35-family 21B IQ1 target: repeated-prefix in-memory cache hit/restore continued through DFlash speculative decode instead of falling back to AR.

cubic-dev-ai

2 issues found across 27 files

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

cubic-dev-ai

5 issues found across 32 files

_{You’re at about 91% of the monthly reviewed-line limit. You may want to disable incremental reviews to conserve quota. Reviews will continue until that limit is exceeded. If you need help avoiding interruptions, please contact contact@cubic.dev.}

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/src/common/backend_factory.cpp">

<violation number="1" location="server/src/common/backend_factory.cpp:55">
P2: Layer-split qwen35 path silently drops multiple runtime/decode options present in non-split path</violation>
</file>

<file name="server/test/test_server_unit.cpp">

<violation number="1" location="server/test/test_server_unit.cpp:1366">
P2: Restore-path test expects DFlash on snapshot restore, but the intended contract is AR fallback until shard replay exists.</violation>
</file>

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

cubic-dev-ai · 2026-05-28T07:13:31Z

+            cfg.device             = args.device;
+            cfg.draft_gpu          = args.draft_device.gpu;
+            cfg.remote_draft       = args.remote_draft;
+            cfg.fa_window          = args.fa_window;


P2: Layer-split qwen35 path silently drops multiple runtime/decode options present in non-split path

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/backend_factory.cpp, line 55: <comment>Layer-split qwen35 path silently drops multiple runtime/decode options present in non-split path</comment> <file context> @@ -42,6 +45,30 @@ std::unique_ptr<ModelBackend> create_backend(const BackendArgs & args) { + cfg.device = args.device; + cfg.draft_gpu = args.draft_device.gpu; + cfg.remote_draft = args.remote_draft; + cfg.fa_window = args.fa_window; + cfg.kq_stride_pad = args.kq_stride_pad; + cfg.draft_ctx_max = args.draft_ctx_max; </file context>

cubic-dev-ai · 2026-05-28T07:13:31Z

+    GenerateResult restored = backend.restore_and_generate(2, restore_req, io);
+
+    TEST_ASSERT(restored.ok);
+    TEST_ASSERT(raw->dflash_called);


P2: Restore-path test expects DFlash on snapshot restore, but the intended contract is AR fallback until shard replay exists.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/test/test_server_unit.cpp, line 1366: <comment>Restore-path test expects DFlash on snapshot restore, but the intended contract is AR fallback until shard replay exists.</comment> <file context> @@ -1184,6 +1188,211 @@ static void test_normalize_responses_tool_followup_messages() { + GenerateResult restored = backend.restore_and_generate(2, restore_req, io); + + TEST_ASSERT(restored.ok); + TEST_ASSERT(raw->dflash_called); + TEST_ASSERT(raw->restored_slot == 2); + TEST_ASSERT(!raw->reset_called); </file context>

Suggested change

TEST_ASSERT(raw->dflash_called);

TEST_ASSERT(!raw->dflash_called);

Record the clean integration of PRs Luce-Org#265 and Luce-Org#273, the refreshed conflict probes for the remaining selective-port PRs, and the current validation results.

weicj force-pushed the feat-cpp-server-target-layer-split-prep branch from 80108ba to 902cf3b Compare May 23, 2026 13:12

weicj marked this pull request as ready for review May 23, 2026 15:58

cubic-dev-ai Bot reviewed May 23, 2026

View reviewed changes

Comment thread dflash/src/common/layer_split_backend.cpp Outdated

Comment thread dflash/src/common/backend_ipc.cpp Outdated

weicj force-pushed the feat-cpp-server-target-layer-split-prep branch from 902cf3b to 6fd8d9e Compare May 23, 2026 16:53

weicj mentioned this pull request May 24, 2026

feat(server): add Gemma4 target-layer-split adapter #273

Merged

weicj marked this pull request as draft May 24, 2026 06:28

weicj added 6 commits May 28, 2026 13:50

feat(server): add target-layer-split backend adapter path

66ab05c

refactor(server): generalize target layer-split adapter path

599217d

test(server): keep server unit tests CPU-only

8415f1e

fix(server): restore DFlash state for split targets

b4ce59b

fix(server): align layer split load plan with target load plan

061e31c

fix(server): link layer split unit tests with common library

73c4a85

weicj force-pushed the feat-cpp-server-target-layer-split-prep branch from ed8d5e2 to 73c4a85 Compare May 28, 2026 06:21

weicj marked this pull request as ready for review May 28, 2026 07:07

cubic-dev-ai Bot reviewed May 28, 2026

View reviewed changes

fix(server): tighten layer-split validation and cleanup

054af28

davide221 merged commit 43457d8 into Luce-Org:main May 28, 2026
3 checks passed

This was referenced May 28, 2026

pflash + dflash optimization on top of qwen35moe (PR #262) #280

Open

feat(pflash): prefill compress up to 128k -> 2-12× prefill (content-dependent), decode at parity #274

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(server): add target-layer-split backend adapter path#265

feat(server): add target-layer-split backend adapter path#265
davide221 merged 7 commits into
Luce-Org:mainfrom
weicj:feat-cpp-server-target-layer-split-prep

weicj commented May 23, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

cubic-dev-ai Bot May 28, 2026

Uh oh!

cubic-dev-ai Bot May 28, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	TEST_ASSERT(raw->dflash_called);
	TEST_ASSERT(!raw->dflash_called);

Conversation

weicj commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Notes

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

weicj commented May 23, 2026 •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading