diff --git a/docs/index.md b/docs/index.md index 8cce9e77..fb443ff4 100644 --- a/docs/index.md +++ b/docs/index.md @@ -40,13 +40,14 @@ Organized by domain (model line / subsystem / playbook / lesson) instead of by l | Path | TL;DR | | --- | --- | -| `models/qwen35/roadmap.md` | Qwen3.5-4B roadmap (2026-06 review): decode-tuning refresh improves direct TPOT by 2-3%, while vLLM still leads 1024/256 HTTP decode and high-concurrency throughput. Open items: HND prefill staging, prefix-cache design, serving concurrency. | +| `models/qwen35/roadmap.md` | Qwen3.5-4B roadmap (2026-06 review): decode-tuning refresh improves direct TPOT by 2-3%, while vLLM still leads 1024/256 HTTP decode and high-concurrency throughput. Open items: HND prefill staging, TP Phase 1 implementation, prefix-cache design, serving concurrency. | | `models/qwen35/kv-admission.md` | Issue #254 complete: Qwen3.5 now uses full-lifetime KV admission, deferred pressure handling, impossible-request rejection, explicit error semantics, direct rejection-event coverage, RTX 5090 e2e, and real HTTP pressure/post-pressure validation. | | `models/qwen35/optimization.md` | Hybrid 24 linear + 8 full attn optimization ledger. Decode-tuning refresh fuses MLP gate/up and tunes decode cublasLt buckets, improving direct TPOT by 2-3%; vLLM still leads 1024/256 HTTP decode. | | `models/qwen35/accuracy.md` | Qwen3.5-4B HF bf16 logits goldens through `past_key_values`: short replay covers sequential graph, bucket-straddling batched graph, and slot-compaction; long replay covers 4097/8192-token prompts; full GSM8K 8-shot now matches the HF baseline within 0.15 percentage points. | | `models/qwen35/model-crate.md` | `openinfer-qwen35-4b` owns Qwen3.5 model/scheduler/recurrent ops/tests/benches; feature-gated behind `qwen35-4b` (Triton AOT is the only Python build dependency); root loads it through `EngineHandle`. Build/check/clippy, root bench sanity check, historical Qwen3.5 e2e, and scheduler e2e records live here. | | `models/qwen35/kernel-plan.md` | Qwen3.5-4B has a `openinfer_qwen35_4b::kernel_plan()` static descriptor mirroring the qwen3 module — enumerates every prefill/decode/unified op with its Rust call site, backend, and notes, so you can dump the active kernel mix without reading call sites. Pure refactor (issue #256), no kernel behavior change. | | `models/qwen35/batched-step-tail.md` | Qwen3.5 issue #353 implementation record: final prefill tail is batched, decode/unified sample from batched logits, host full-vocab copies are logprobs-only, HF + scheduler e2e pass, and final serving A/B supports only the first-token/short-output TTFT claim. | +| `models/qwen35/tp-design.md` | Qwen3.5 TP design: Phase 1 is eager dense TP on Qwen3's controller/worker runtime; validate TP2 first, fail closed for indivisible degrees and TP+CUDA Graph, shard dense full-attention/MLP, and leave sharded linear/GDR state to follow-up. | ## models / deepseek-v4 diff --git a/docs/models/qwen35/roadmap.md b/docs/models/qwen35/roadmap.md index b8950bd3..f60bde9d 100644 --- a/docs/models/qwen35/roadmap.md +++ b/docs/models/qwen35/roadmap.md @@ -1,6 +1,6 @@ # Qwen3.5-4B Roadmap -> **TL;DR:** Qwen3.5-4B is decode-correct and still improving: the decode-tuning refresh improves direct TPOT by `2.1-3.2%`, while vLLM still leads 1024/256 HTTP decode and high-concurrency throughput. Long-prompt HF logits and GSM8K gates cover the old 4096-position RoPE boundary. Remaining structural items are HND prefill staging, prefix-cache design, and the serving-level concurrency gap. +> **TL;DR:** Qwen3.5-4B is decode-correct and still improving: the decode-tuning refresh improves direct TPOT by `2.1-3.2%`, while vLLM still leads 1024/256 HTTP decode and high-concurrency throughput. Long-prompt HF logits and GSM8K gates cover the old 4096-position RoPE boundary. Remaining structural items are HND prefill staging, TP Phase 1 implementation, prefix-cache design, and the serving-level concurrency gap. > > **Last touched:** 2026-06 @@ -20,7 +20,7 @@ Tracking issue: see the `[Model] Qwen3.5-4B roadmap` GitHub issue. Sibling doc: | Admission | ✓ existing full-lifetime KV admission and explicit `Rejected` events cover impossible KV requests; #253 adds the context-window rejection reason before prefill/decode | `scheduler.rs`, `src/scheduler/plan.rs`, `docs/models/qwen35/kv-admission.md` | | Scheduler tests | Partial: current plan selection, full-lifetime admission, context-window rejection, slot assignment, and slot-compaction decisions are CPU-tested; GPU execution remains coupled to the production scheduler | `src/scheduler/plan.rs` | | Step tail | Local branch verified: #353 batches the prefill final norm/lm_head tail, samples decode/unified rows from batched logits, and keeps host full-vocab copies only for requested logprobs; HF/e2e gates pass, short-output serving A/B shows TTFT benefit, long-decode TPOT remains a no-claim diagnostic | `docs/models/qwen35/batched-step-tail.md` | -| TP | ✗ absent (single GPU only) | — | +| TP | Design settled for Phase 1 eager dense TP: reuse Qwen3 controller/worker runtime, validate `TP=2` first, shard dense full-attention/MLP, replicate linear-attention/GDR state per rank, and fail closed for `TP > 1` CUDA Graph or indivisible shapes. Implementation still pending. | `docs/models/qwen35/tp-design.md` | | Prefix cache | ✗ absent; recurrent GDR state (~48MB per boundary snapshot) makes "prefix hit" itself a design question | — | ## Roadmap @@ -41,7 +41,7 @@ Tracking issue: see the `[Model] Qwen3.5-4B roadmap` GitHub issue. Sibling doc: ### Later -- **TP** — no sharding design exists for the hybrid stack (GDR state sharding is the open question). Design-first, no driver today. +- **TP Phase 2** — sharded linear attention / GDR recurrent state remains future design work after Phase 1 eager dense TP is correct. - **CUDA-graph prefill** — prefill is eager and serial; revisit after 6 changes the memory layout. ## Cleanup ledger diff --git a/docs/models/qwen35/tp-design.md b/docs/models/qwen35/tp-design.md new file mode 100644 index 00000000..210fcd2c --- /dev/null +++ b/docs/models/qwen35/tp-design.md @@ -0,0 +1,244 @@ +# Qwen3.5 Tensor Parallelism Design + +> **TL;DR:** Qwen3.5 tensor parallelism should reuse Qwen3's controller/worker TP runtime and stay degree-parametric. Phase 1 is correctness-first eager dense TP: validate `TP=2` first, fail closed on indivisible degrees and `TP > 1` CUDA Graph, shard dense full-attention/MLP, and keep linear-attention/GDR state replicated per rank before tackling sharded GDR state. +> +> **Last touched:** 2026-06 + +## Goal + +Add tensor-parallel support for `Qwen3.5-4B` by reusing the Qwen3 TP runtime instead of designing a second parallel execution stack. + +The implementation should be degree-parametric where the model dimensions divide cleanly. `TP=2` is the first validation target, not an architectural limit. Unsupported or indivisible degrees must fail closed before model load. + +## Qwen3 Runtime Reuse + +Reuse the Qwen3 TP shape: + +- controller/worker broadcast execution model +- `RequestId` request identity +- coarse-grained prefill/decode/unified/drop step protocol +- rank-local worker-owned model state +- rank-local CUDA context, cuBLAS, graph, and NCCL resources +- hidden all-reduce after row-parallel projections +- replicated embedding/lm_head as the first-pass simplification + +Qwen3.5-specific design work should stay focused on model geometry and state ownership: hybrid layer layout, gated q projection, linear-attention conv state, and GDR recurrent state. + +## Boundaries + +This design does not cover multi-node TP, data parallelism, pipeline parallelism, vocab-parallel embedding/lm_head, or Qwen3.5 prefix-cache/recurrent-state snapshots. + +Phase 1 does not shard linear attention or change GDR kernel shapes. Phase 2 does not change GDR math, does not all-reduce recurrent state, and does not move recurrent state ownership back into the scheduler. + +## Settled Phase 1 Contract + +These decisions are settled before implementation starts. + +- `TP=1` must preserve the current single-GPU behavior. +- `TP=2` is the first correctness target. The implementation may stay degree-parametric, but unsupported or indivisible degrees must fail before model load. +- `TP > 1` is eager-only in Phase 1. CUDA Graph under TP must fail closed instead of silently falling back or partially capturing. +- Reuse the Qwen3 controller/worker broadcast execution model and avoid a second long-lived Qwen3.5-specific TP runtime shape. +- Shard dense full-attention and MLP operators. +- Replicate embedding and tied `lm_head`. +- Replicate linear-attention/GDR weights in Phase 1. +- Each rank worker owns and mutates its own full linear-attention conv state and GDR recurrent state copy. +- The scheduler owns logical request lifecycle and logical KV/page lifecycle only. +- Full-attention KV is physically rank-local and sharded by local KV heads, but one logical request/page assignment is mirrored across all ranks. +- `DropRequest`, finish cleanup, cancellation cleanup, and slot reuse must release or reset the corresponding rank-local KV/recurrent/conv state on every rank. +- Qwen3.5 gated `q_proj` slicing is an explicit acceptance gate: every rank must receive both q rows and gate rows for its local query heads. +- MLP gate/up row sharding and down column sharding require explicit reconstruction or layout tests. + +## Still Open / Future Discussion + +These topics should not block Phase 1 eager dense TP, but they remain design work before any later implementation. + +- TP CUDA Graph support: graph state ownership per rank, synchronized capture/replay order, NCCL capture behavior, graph padding slots, and recurrent/conv D2D slot compaction under capture. +- Sharded linear-attention/GDR execution: local GDR AOT kernel shapes, local recurrent-state layout, local conv state layout, and Phase 2 weight slicing. +- TP-aware prefix cache or recurrent-state snapshots. +- Vocab-parallel embedding or `lm_head`. +- Multi-node TP, data parallelism, and pipeline parallelism. +- Performance optimization claims. Phase 1 is a correctness/runtime milestone, not a throughput milestone. + +## Why Dense First, GDR Second + +Qwen3.5 has two separable TP problems. + +The dense part is already proven by Qwen3: full-attention head sharding, local KV heads, MLP intermediate sharding, all-reduce after row-parallel projections, and worker-thread CUDA/NCCL execution. + +The linear-attention part is Qwen3.5-specific: conv state and GDR recurrent state are long-lived request state, current GDR AOT kernels are built for the global value-head shape, and slot compaction / graph padding / `DropRequest` must all preserve rank-local recurrent state. If dense TP and GDR TP land together, failures are hard to attribute. Phase 1 narrows correctness debugging to runtime + dense sharding; Phase 2 then isolates the GDR/recurrent contract. + +## Architecture Summary + +Qwen3.5-4B: + +- 32 layers: 24 linear attention + 8 full attention +- full-attention layers: `3, 7, 11, 15, 19, 23, 27, 31` +- `hidden_size = 2560` +- `intermediate_size = 9216` +- tied embedding/lm_head +- `vocab_size = 248320` + +Full attention: + +- `num_attention_heads = 16` +- `num_key_value_heads = 4` +- `head_dim = 256` +- `q_dim = num_attention_heads * head_dim = 4096` +- `kv_dim = num_key_value_heads * head_dim = 1024` +- q projection includes an output gate, so gated q projection output dim is `2 * q_dim = 8192` + +Linear attention: + +- `linear_num_key_heads = 16` +- `linear_key_head_dim = 128` +- `linear_num_value_heads = 32` +- `linear_value_head_dim = 128` +- `linear_q_dim = linear_num_key_heads * linear_key_head_dim = 2048` +- `linear_k_dim = linear_q_dim` +- `linear_v_dim = linear_num_value_heads * linear_value_head_dim = 4096` +- `linear_qkv_dim = linear_q_dim + linear_k_dim + linear_v_dim = 8192` +- `linear_z_dim = linear_v_dim = 4096` +- recurrent state per linear layer: `[linear_num_value_heads, linear_key_head_dim, linear_value_head_dim] f32` +- conv state per linear layer: `linear_qkv_dim * (conv_kernel_dim - 1)` bf16 + +## Partition Contract + +For any candidate `tp`, require: + +- `num_attention_heads % tp == 0` +- `num_key_value_heads % tp == 0` +- `intermediate_size % tp == 0` +- Phase 2 additionally requires `linear_num_key_heads % tp == 0` and `linear_num_value_heads % tp == 0` + +Full attention local dimensions: + +- `local_q_heads = num_attention_heads / tp` +- `local_kv_heads = num_key_value_heads / tp` +- `local_q_dim = local_q_heads * head_dim` +- `local_kv_dim = local_kv_heads * head_dim` +- `local_gated_q_dim = 2 * local_q_dim` + +Qwen3.5 full-attention `q_proj` must be sharded by head-local q/gate pairs. Each rank owns a contiguous query-head range, and for each owned head it must receive both that head's q rows and that head's gate rows. Do not reuse a naive contiguous row shard if the physical layout can split q rows from their gate rows. + +MLP local dimensions: + +- `local_intermediate = intermediate_size / tp` +- local fused `gate_up_proj` rows: `2 * local_intermediate` +- local `down_proj` input cols: `local_intermediate` + +Linear-attention local dimensions for Phase 2: + +- `local_linear_key_heads = linear_num_key_heads / tp` +- `local_linear_value_heads = linear_num_value_heads / tp` +- `local_linear_q_dim = local_linear_key_heads * linear_key_head_dim` +- `local_linear_k_dim = local_linear_q_dim` +- `local_linear_v_dim = local_linear_value_heads * linear_value_head_dim` +- `local_linear_qkv_dim = local_linear_q_dim + local_linear_k_dim + local_linear_v_dim` +- `local_linear_z_dim = local_linear_v_dim` +- local recurrent state: `[local_linear_value_heads, linear_key_head_dim, linear_value_head_dim] f32` +- local conv state: `local_linear_qkv_dim * (conv_kernel_dim - 1)` bf16 + +## Phase 1: Dense TP, Replicated Linear Attention + +Shard: + +- full-attention `q_proj`, `k_proj`, `v_proj`, `o_proj` +- full-attention KV cache over local KV heads +- MLP `gate_proj`, `up_proj`, `down_proj` + +Replicate: + +- embedding and tied lm_head +- all linear-attention weights +- all linear-attention conv state +- all GDR recurrent state +- existing GDR kernels and scratch shapes + +Execution: + +- full-attention: local q/k/v + local attention + local `o_proj`, then all-reduce hidden +- MLP: local gate/up + local activation + local `down_proj`, then all-reduce hidden +- linear attention: every rank runs the full layer and updates a full local recurrent-state copy; do not all-reduce replicated linear-attention output + +State ownership: + +- scheduler owns request admission, request identity, logical page allocation, streaming handles, sampling params, generation counters, and finish bookkeeping +- rank workers own rank-local model shards, rank-local physical KV buffers, rank-local decode buffers, and rank-local recurrent/conv state +- rank 0 is not special for state mutation; it follows the same worker command protocol as other ranks +- non-primary workers may return acknowledgement or step failure only, while the primary worker returns artifacts for scheduler-side result resolution +- all workers must observe the same ordered `RunPrefillStep`, `RunDecodeStep`, `RunUnifiedStep`, `DropRequest`, and `Shutdown` commands + +CUDA Graph: + +- Phase 1 TP execution is eager-only +- `tp_size > 1` with CUDA Graph enabled must return an explicit startup/configuration error before serving requests +- TP graph capture is a follow-up because Qwen3.5 graph state includes recurrent slots, slot compaction, padding slots, and NCCL ordering questions + +Validation scope: + +- first validated degree: `TP=2` +- Qwen3.5 HF logits gate +- Qwen3.5 scheduler e2e +- long prompt / chunked prefill path +- slot-compaction replay +- finish/drop followed by slot reuse without stale recurrent or conv state +- gated `q_proj` head-local q/gate slicing test +- MLP gate/up shard and down shard reconstruction/layout test +- basic TP2 serving smoke +- startup fails closed for unsupported or indivisible degrees +- startup fails closed for `tp_size > 1` with CUDA Graph enabled + +## Phase 2: Sharded Linear Attention / GDR + +Phase 2 converts linear attention from replicated execution to true TP execution. + +Shard: + +- `in_proj_qkv`, `in_proj_z`, `in_proj_b`, `in_proj_a` +- `dt_bias`, `A_log` +- conv state +- GDR recurrent state +- linear-attention `out_proj` + +Execution: + +- each rank computes local q/k/v/z/b/a +- each rank updates only local conv state and local GDR recurrent state +- each rank runs local gated RMSNorm/output-gate work +- each rank runs local `out_proj` +- all-reduce happens after `out_proj` + +Never all-reduce GDR recurrent state or conv state. Their ownership is rank-local and request-local. + +### vLLM Reference + +Use vLLM's `Qwen3NextForCausalLM` / `QwenGatedDeltaNetAttention` as the reference contract, not as code to copy mechanically: + +- GDN state shape depends on `tp_size` +- q/k/v/z projections are tensor-parallel column projections +- `out_proj` is row-parallel and reduces back to full hidden +- `dt_bias` and `A_log` are sharded over local value heads +- b/a projections are local-value-head aware; some quantized paths may replicate small projections and slice locally +- GDR prefill/decode kernels consume local head/state shapes + +OpenInfer-specific work remains: worker-owned rank-local recurrent state, `RequestId` lifecycle, local-state slot compaction, `DropRequest` cleanup, and fail-closed kernel-shape validation. + +Validation scope: + +- Phase 1 gates still pass +- long HF logits replay under the validated degree +- slot compaction replay +- recurrent-state cleanup on finish/drop +- no stale local recurrent state after slot reuse + +## References + +- `docs/models/qwen3/tp-design.md` +- `openinfer-qwen3-4b/src/config.rs` +- `openinfer-qwen3-4b/src/executor.rs` +- `openinfer-qwen35-4b/src/config.rs` +- `openinfer-qwen35-4b/src/weights.rs` +- `openinfer-qwen35-4b/src/recurrent_state.rs` +- `openinfer-qwen35-4b/src/batch_decode.rs` +- vLLM `Qwen3NextForCausalLM` +- vLLM `QwenGatedDeltaNetAttention`