Skip to content

feat(server): add draft residency policy#290

Draft
weicj wants to merge 1 commit into
Luce-Org:mainfrom
weicj:feat-server-draft-residency-policy
Draft

feat(server): add draft residency policy#290
weicj wants to merge 1 commit into
Luce-Org:mainfrom
weicj:feat-server-draft-residency-policy

Conversation

@weicj
Copy link
Copy Markdown
Collaborator

@weicj weicj commented May 28, 2026

Summary

This PR turns the existing draft park/unpark and --lazy-draft behavior into an explicit C++ server residency policy shared by PFlash and DFlash.

On smaller or tightly split GPU setups, draft weights can remain resident after their active work is finished and compete with target restore, target shards, or the next draft load. This PR adds one shared policy surface so PFlash can release the drafter after compression and DFlash can keep its existing lazy-draft lifecycle through the same resolver.

Changes

Shared residency policy

  • Add server/src/placement/draft_residency.h as the common C++ policy layer on top of the existing draft park/unpark capability.
  • Add three policy modes:
    • auto: preserve the existing resident behavior by default, while still honoring the low-VRAM / lazy-draft hint.
    • persistent: keep draft weights loaded across requests.
    • request-scoped: release or park draft weights after the request-side draft work is complete.
  • Add --draft-residency auto|persistent|request-scoped to the server CLI/config surface.
  • Keep --lazy-draft as a compatibility alias for --draft-residency=request-scoped, so existing callers do not break while the lifecycle logic moves into the shared resolver.
  • Expose the resolved mode in /props.runtime.draft_residency.

PFlash lifecycle

  • Add a residency action to the PFlash compression request path.
  • For local Qwen3/Qwen35 PFlash, request-scoped mode releases the PFlash drafter after compression, before later target/draft restore work needs memory.
  • For remote PFlash, the same policy is carried through the IPC path so the remote drafter can be closed after compression instead of remaining unexpectedly resident.

DFlash lifecycle

  • Route DFlash decode draft through the same policy before and after generation.
  • In request-scoped mode, the decode draft is unparked before generation and parked again after generation.
  • Keep the existing lazy_draft behavior available, but express it through the same residency resolver used by PFlash.

Backend coverage

  • Qwen3 / Qwen35: support PFlash release-after-compress through the local PFlash path.
  • Qwen35-family DFlash: support request-scoped decode draft park/unpark through the shared policy.
  • The policy layer itself is backend-neutral and lives under server/src/placement/, so future draft-producing paths can reuse it without adding another lifecycle flag.

Notes

  • Added unit coverage for policy parsing, PFlash/DFlash residency resolution, default config behavior, and /props reporting.
  • Local CUDA test_server_unit passed with 1568 assertions, 0 failures.
  • Gemma4 draft-only park/unpark support is intentionally left for a follow-up PR because it requires extracting the existing Gemma4 draft-load path out of init().

@weicj weicj force-pushed the feat-server-draft-residency-policy branch from b1cc5a8 to 8cef6bb Compare May 31, 2026 11:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant