Skip to content

fix(server): support sampled requests in target layer split#295

Open
weicj wants to merge 1 commit into
Luce-Org:mainfrom
weicj:fix-layer-split-sampling
Open

fix(server): support sampled requests in target layer split#295
weicj wants to merge 1 commit into
Luce-Org:mainfrom
weicj:fix-layer-split-sampling

Conversation

@weicj
Copy link
Copy Markdown
Collaborator

@weicj weicj commented May 28, 2026

Summary

From the user side, target layer split currently works for greedy requests but rejects normal sampled OpenAI-compatible requests with sampling_unsupported. In practice this means requests using temperature > 0 such as temperature=0.7, or requests using repetition / frequency / presence penalties, fail even though the same model can sample correctly on the single-card backend.

This PR fixes that gap for target layer split. Adapters that can return final-token logits may opt into the existing CPU sampler path, so split and non-split generation follow the same sampling rules. Adapters that have not implemented logits output still fail explicitly instead of silently falling back to inconsistent behavior.

Changes

  • Add a LayerSplitAdapter::supports_cpu_sampling() capability gate.
  • Keep the existing sampling_unsupported protection for adapters that do not opt in.
  • Extend Qwen35 layer-split projection so it can return final-token logits as well as argmax tokens.
  • Extend Gemma4 layer-split projection in the same way, including Gemma4 final logit softcap before CPU sampling.
  • Make Qwen35 and Gemma4 layer-split adapters cache the final prefill logits and use the existing shared sample_logits() path during AR decode when temperature > 0, repetition penalty, frequency penalty, or presence penalty is active.
  • Persist and restore the cached final prefill logits with layer-split snapshots, so sampled restore paths do not fail or reuse stale logits when no new prefill is needed before decode.
  • Keep Qwen35 DFlash speculative decode on the greedy path only; sampled / penalty requests now stay on AR decode where CPU sampling is available.
  • Add unit coverage for the layer-split sampling capability gate.

Notes

  • This does not add a new sampler. It reuses the existing shared CPU sampler so split and non-split paths follow the same sampling rules.
  • Current validation: CUDA SM61 dflash_common, test_server_unit, and dflash_server build passed; test_server_unit passed with 1606 assertions and 0 failures. HIP gfx906 dflash_server build passed, and a dual Pro VII Gemma4 E4B Q4 layer-split smoke returned an OpenAI-compatible completion with temperature=0.7.

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 11 files

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread server/src/qwen35/qwen35_layer_split_adapter.cpp
Comment thread server/src/gemma4/gemma4_layer_split_adapter.cpp
@weicj weicj changed the title fix(server): enable sampling for target layer split fix(server): support sampled requests in target layer split May 28, 2026
@weicj weicj force-pushed the fix-layer-split-sampling branch from 742b86c to a9aedf7 Compare May 28, 2026 18:08
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 28, 2026
# Conflicts:
#	server/src/gemma4/gemma4_layer_split_adapter.cpp
#	server/src/qwen35/qwen35_layer_split_adapter.cpp
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 28, 2026
Remove the repeated drafter/anchor test target block so CMake can configure the integration stack after merging upstream and PR Luce-Org#295.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant