fix(server): support sampled requests in target layer split by weicj · Pull Request #295 · Luce-Org/lucebox-hub

weicj · 2026-05-28T17:55:11Z

Summary

From the user side, target layer split currently works for greedy requests but rejects normal sampled OpenAI-compatible requests with sampling_unsupported. In practice this means requests using temperature > 0 such as temperature=0.7, or requests using repetition / frequency / presence penalties, fail even though the same model can sample correctly on the single-card backend.

This PR fixes that gap for target layer split. Adapters that can return final-token logits may opt into the existing CPU sampler path, so split and non-split generation follow the same sampling rules. Adapters that have not implemented logits output still fail explicitly instead of silently falling back to inconsistent behavior.

Changes

Add a LayerSplitAdapter::supports_cpu_sampling() capability gate.
Keep the existing sampling_unsupported protection for adapters that do not opt in.
Extend Qwen35 layer-split projection so it can return final-token logits as well as argmax tokens.
Extend Gemma4 layer-split projection in the same way, including Gemma4 final logit softcap before CPU sampling.
Make Qwen35 and Gemma4 layer-split adapters cache the final prefill logits and use the existing shared sample_logits() path during AR decode when temperature > 0, repetition penalty, frequency penalty, or presence penalty is active.
Persist and restore the cached final prefill logits with layer-split snapshots, so sampled restore paths do not fail or reuse stale logits when no new prefill is needed before decode.
Keep Qwen35 DFlash speculative decode on the greedy path only; sampled / penalty requests now stay on AR decode where CPU sampling is available.
Add unit coverage for the layer-split sampling capability gate.

Notes

This does not add a new sampler. It reuses the existing shared CPU sampler so split and non-split paths follow the same sampling rules.
Current validation: CUDA SM61 dflash_common, test_server_unit, and dflash_server build passed; test_server_unit passed with 1606 assertions and 0 failures. HIP gfx906 dflash_server build passed, and a dual Pro VII Gemma4 E4B Q4 layer-split smoke returned an OpenAI-compatible completion with temperature=0.7.

cubic-dev-ai

2 issues found across 11 files

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

# Conflicts: # server/src/gemma4/gemma4_layer_split_adapter.cpp # server/src/qwen35/qwen35_layer_split_adapter.cpp

Remove the repeated drafter/anchor test target block so CMake can configure the integration stack after merging upstream and PR Luce-Org#295.

cubic-dev-ai Bot reviewed May 28, 2026

View reviewed changes

Comment thread server/src/qwen35/qwen35_layer_split_adapter.cpp

Comment thread server/src/gemma4/gemma4_layer_split_adapter.cpp

weicj changed the title ~~fix(server): enable sampling for target layer split~~ fix(server): support sampled requests in target layer split May 28, 2026

fix(server): enable sampling for target layer split

a9aedf7

weicj force-pushed the fix-layer-split-sampling branch from 742b86c to a9aedf7 Compare May 28, 2026 18:08

easel pushed a commit to easel/lucebox-hub that referenced this pull request May 28, 2026

Merge PR Luce-Org#295 into auto-integration

98da34f

# Conflicts: # server/src/gemma4/gemma4_layer_split_adapter.cpp # server/src/qwen35/qwen35_layer_split_adapter.cpp

easel pushed a commit to easel/lucebox-hub that referenced this pull request May 28, 2026

fix(ci): remove duplicate CMake test registrations

72355a1

Remove the repeated drafter/anchor test target block so CMake can configure the integration stack after merging upstream and PR Luce-Org#295.

This was referenced May 28, 2026

feat(server): add Laguna target-layer-split adapter #297

Open

refactor(server): share target layer-split runtime helpers #306

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(server): support sampled requests in target layer split#295

fix(server): support sampled requests in target layer split#295
weicj wants to merge 1 commit into
Luce-Org:mainfrom
weicj:fix-layer-split-sampling

weicj commented May 28, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

weicj commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Notes

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

weicj commented May 28, 2026 •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading