refactor(server): share target layer-split runtime helpers by weicj · Pull Request #306 · Luce-Org/lucebox-hub

weicj · 2026-05-29T07:47:55Z

Summary

This PR refactors the target layer-split server path so new model adapters do not need to keep copying the same runtime shell. The existing Qwen35, Gemma4, and Laguna adapters all repeat shard/range/snapshot-backend setup and the autoregressive decode/sampling loop; this moves those shared pieces into server/src/common/layer_split_runtime.* while keeping model-specific loading, graph execution, cache layout, snapshot payloads, and EOS rules inside each adapter.

Changes

Add server/src/common/layer_split_runtime.h/.cpp.
- init_layer_split_runtime() centralizes target path validation, GGUF layer-count inspection, layer range calculation, shard metadata initialization, peer-access setup, and snapshot backend initialization.
- run_layer_split_ar_decode() centralizes the AR decode loop, CPU sampling handoff, token emission, cancellation handling, and EOS termination.
Update the Qwen35, Gemma4, and Laguna layer-split adapters to call the shared runtime helpers.
- Qwen35 still owns DFlash draft loading, feature-ring handling, DFlash speculative decode, and its Qwen-specific forward path.
- Gemma4 and Laguna still own their partial loaders, per-layer graph builders, cache/snapshot formats, and model-specific decode-position offset.
Register the new runtime source in server/CMakeLists.txt.

Notes

Draft note: this branch is prepared as a follow-up to feat(server): add Laguna target-layer-split adapter #297 and should be rebased once fix(server): support sampled requests in target layer split #295/feat(server): add Laguna target-layer-split adapter #297 land.
This is intended as a follow-up to the target layer-split adapter work, not a user-facing feature change.
The refactor keeps the adapter boundary explicit: shared runtime policy stays in common, while architecture-specific execution remains under each model directory.
Further cleanup can move more repeated snapshot or handoff code into shared layer-split runtime helpers once the Qwen35/Gemma4/Laguna adapter shapes settle.

cubic-dev-ai

Review completed

_{Re-trigger cubic}

weicj added 3 commits May 29, 2026 02:07

fix(server): enable sampling for target layer split

a9aedf7

feat(server): add Laguna target-layer-split adapter

53dd168

refactor(server): share target layer-split runtime helpers

988fc93

weicj marked this pull request as ready for review May 29, 2026 10:40

cubic-dev-ai Bot reviewed May 29, 2026

View reviewed changes

weicj mentioned this pull request May 29, 2026

feat(server): reduce layer-split activation memory with backend precision policy #310

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(server): share target layer-split runtime helpers#306

refactor(server): share target layer-split runtime helpers#306
weicj wants to merge 3 commits into
Luce-Org:mainfrom
weicj:refactor-server-layer-split-runtime

weicj commented May 29, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

weicj commented May 29, 2026

Summary

Changes

Notes

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant