feat(server): add draft residency policy by weicj · Pull Request #290 · Luce-Org/lucebox-hub

weicj · 2026-05-28T09:47:12Z

Summary

This PR turns the existing draft park/unpark and --lazy-draft behavior into an explicit C++ server residency policy shared by PFlash and DFlash.

On smaller or tightly split GPU setups, draft weights can remain resident after their active work is finished and compete with target restore, target shards, or the next draft load. This PR adds one shared policy surface so PFlash can release the drafter after compression and DFlash can keep its existing lazy-draft lifecycle through the same resolver.

Changes

Shared residency policy

Add server/src/placement/draft_residency.h as the common C++ policy layer on top of the existing draft park/unpark capability.
Add three policy modes:
- auto: preserve the existing resident behavior by default, while still honoring the low-VRAM / lazy-draft hint.
- persistent: keep draft weights loaded across requests.
- request-scoped: release or park draft weights after the request-side draft work is complete.
Add --draft-residency auto|persistent|request-scoped to the server CLI/config surface.
Keep --lazy-draft as a compatibility alias for --draft-residency=request-scoped, so existing callers do not break while the lifecycle logic moves into the shared resolver.
Expose the resolved mode in /props.runtime.draft_residency.

PFlash lifecycle

Add a residency action to the PFlash compression request path.
For local Qwen3/Qwen35 PFlash, request-scoped mode releases the PFlash drafter after compression, before later target/draft restore work needs memory.
For remote PFlash, the same policy is carried through the IPC path so the remote drafter can be closed after compression instead of remaining unexpectedly resident.

DFlash lifecycle

Route DFlash decode draft through the same policy before and after generation.
In request-scoped mode, the decode draft is unparked before generation and parked again after generation.
Keep the existing lazy_draft behavior available, but express it through the same residency resolver used by PFlash.

Backend coverage

Qwen3 / Qwen35: support PFlash release-after-compress through the local PFlash path.
Qwen35-family DFlash: support request-scoped decode draft park/unpark through the shared policy.
The policy layer itself is backend-neutral and lives under server/src/placement/, so future draft-producing paths can reuse it without adding another lifecycle flag.

Notes

Added unit coverage for policy parsing, PFlash/DFlash residency resolution, default config behavior, and /props reporting.
Local CUDA test_server_unit passed with 1568 assertions, 0 failures.
Gemma4 draft-only park/unpark support is intentionally left for a follow-up PR because it requires extracting the existing Gemma4 draft-load path out of init().

weicj mentioned this pull request May 28, 2026

feat(server): add Gemma4 draft residency support #291

Draft

feat(server): add draft residency policy

8cef6bb

weicj force-pushed the feat-server-draft-residency-policy branch from b1cc5a8 to 8cef6bb Compare May 31, 2026 11:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(server): add draft residency policy#290

feat(server): add draft residency policy#290
weicj wants to merge 1 commit into
Luce-Org:mainfrom
weicj:feat-server-draft-residency-policy

weicj commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

weicj commented May 28, 2026

Summary

Changes

Shared residency policy

PFlash lifecycle

DFlash lifecycle

Backend coverage

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant