Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,13 +40,14 @@ Organized by domain (model line / subsystem / playbook / lesson) instead of by l

| Path | TL;DR |
| --- | --- |
| `models/qwen35/roadmap.md` | Qwen3.5-4B roadmap (2026-06 review): decode-tuning refresh improves direct TPOT by 2-3%, while vLLM still leads 1024/256 HTTP decode and high-concurrency throughput. Open items: HND prefill staging, prefix-cache design, serving concurrency. |
| `models/qwen35/roadmap.md` | Qwen3.5-4B roadmap (2026-06 review): decode-tuning refresh improves direct TPOT by 2-3%, while vLLM still leads 1024/256 HTTP decode and high-concurrency throughput. Open items: HND prefill staging, TP Phase 1 implementation, prefix-cache design, serving concurrency. |
| `models/qwen35/kv-admission.md` | Issue #254 complete: Qwen3.5 now uses full-lifetime KV admission, deferred pressure handling, impossible-request rejection, explicit error semantics, direct rejection-event coverage, RTX 5090 e2e, and real HTTP pressure/post-pressure validation. |
| `models/qwen35/optimization.md` | Hybrid 24 linear + 8 full attn optimization ledger. Decode-tuning refresh fuses MLP gate/up and tunes decode cublasLt buckets, improving direct TPOT by 2-3%; vLLM still leads 1024/256 HTTP decode. |
| `models/qwen35/accuracy.md` | Qwen3.5-4B HF bf16 logits goldens through `past_key_values`: short replay covers sequential graph, bucket-straddling batched graph, and slot-compaction; long replay covers 4097/8192-token prompts; full GSM8K 8-shot now matches the HF baseline within 0.15 percentage points. |
| `models/qwen35/model-crate.md` | `openinfer-qwen35-4b` owns Qwen3.5 model/scheduler/recurrent ops/tests/benches; feature-gated behind `qwen35-4b` (Triton AOT is the only Python build dependency); root loads it through `EngineHandle`. Build/check/clippy, root bench sanity check, historical Qwen3.5 e2e, and scheduler e2e records live here. |
| `models/qwen35/kernel-plan.md` | Qwen3.5-4B has a `openinfer_qwen35_4b::kernel_plan()` static descriptor mirroring the qwen3 module — enumerates every prefill/decode/unified op with its Rust call site, backend, and notes, so you can dump the active kernel mix without reading call sites. Pure refactor (issue #256), no kernel behavior change. |
| `models/qwen35/batched-step-tail.md` | Qwen3.5 issue #353 implementation record: final prefill tail is batched, decode/unified sample from batched logits, host full-vocab copies are logprobs-only, HF + scheduler e2e pass, and final serving A/B supports only the first-token/short-output TTFT claim. |
| `models/qwen35/tp-design.md` | Qwen3.5 TP design: Phase 1 is eager dense TP on Qwen3's controller/worker runtime; validate TP2 first, fail closed for indivisible degrees and TP+CUDA Graph, shard dense full-attention/MLP, and leave sharded linear/GDR state to follow-up. |

## models / deepseek-v4

Expand Down
6 changes: 3 additions & 3 deletions docs/models/qwen35/roadmap.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Qwen3.5-4B Roadmap

> **TL;DR:** Qwen3.5-4B is decode-correct and still improving: the decode-tuning refresh improves direct TPOT by `2.1-3.2%`, while vLLM still leads 1024/256 HTTP decode and high-concurrency throughput. Long-prompt HF logits and GSM8K gates cover the old 4096-position RoPE boundary. Remaining structural items are HND prefill staging, prefix-cache design, and the serving-level concurrency gap.
> **TL;DR:** Qwen3.5-4B is decode-correct and still improving: the decode-tuning refresh improves direct TPOT by `2.1-3.2%`, while vLLM still leads 1024/256 HTTP decode and high-concurrency throughput. Long-prompt HF logits and GSM8K gates cover the old 4096-position RoPE boundary. Remaining structural items are HND prefill staging, TP Phase 1 implementation, prefix-cache design, and the serving-level concurrency gap.
>
> **Last touched:** 2026-06

Expand All @@ -20,7 +20,7 @@ Tracking issue: see the `[Model] Qwen3.5-4B roadmap` GitHub issue. Sibling doc:
| Admission | ✓ existing full-lifetime KV admission and explicit `Rejected` events cover impossible KV requests; #253 adds the context-window rejection reason before prefill/decode | `scheduler.rs`, `src/scheduler/plan.rs`, `docs/models/qwen35/kv-admission.md` |
| Scheduler tests | Partial: current plan selection, full-lifetime admission, context-window rejection, slot assignment, and slot-compaction decisions are CPU-tested; GPU execution remains coupled to the production scheduler | `src/scheduler/plan.rs` |
| Step tail | Local branch verified: #353 batches the prefill final norm/lm_head tail, samples decode/unified rows from batched logits, and keeps host full-vocab copies only for requested logprobs; HF/e2e gates pass, short-output serving A/B shows TTFT benefit, long-decode TPOT remains a no-claim diagnostic | `docs/models/qwen35/batched-step-tail.md` |
| TP | ✗ absent (single GPU only) | — |
| TP | Design settled for Phase 1 eager dense TP: reuse Qwen3 controller/worker runtime, validate `TP=2` first, shard dense full-attention/MLP, replicate linear-attention/GDR state per rank, and fail closed for `TP > 1` CUDA Graph or indivisible shapes. Implementation still pending. | `docs/models/qwen35/tp-design.md` |
| Prefix cache | ✗ absent; recurrent GDR state (~48MB per boundary snapshot) makes "prefix hit" itself a design question | — |

## Roadmap
Expand All @@ -41,7 +41,7 @@ Tracking issue: see the `[Model] Qwen3.5-4B roadmap` GitHub issue. Sibling doc:

### Later

- **TP** — no sharding design exists for the hybrid stack (GDR state sharding is the open question). Design-first, no driver today.
- **TP Phase 2** — sharded linear attention / GDR recurrent state remains future design work after Phase 1 eager dense TP is correct.
- **CUDA-graph prefill** — prefill is eager and serial; revisit after 6 changes the memory layout.

## Cleanup ledger
Expand Down
244 changes: 244 additions & 0 deletions docs/models/qwen35/tp-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,244 @@
# Qwen3.5 Tensor Parallelism Design

> **TL;DR:** Qwen3.5 tensor parallelism should reuse Qwen3's controller/worker TP runtime and stay degree-parametric. Phase 1 is correctness-first eager dense TP: validate `TP=2` first, fail closed on indivisible degrees and `TP > 1` CUDA Graph, shard dense full-attention/MLP, and keep linear-attention/GDR state replicated per rank before tackling sharded GDR state.
>
> **Last touched:** 2026-06

## Goal

Add tensor-parallel support for `Qwen3.5-4B` by reusing the Qwen3 TP runtime instead of designing a second parallel execution stack.

The implementation should be degree-parametric where the model dimensions divide cleanly. `TP=2` is the first validation target, not an architectural limit. Unsupported or indivisible degrees must fail closed before model load.

## Qwen3 Runtime Reuse

Reuse the Qwen3 TP shape:

- controller/worker broadcast execution model
- `RequestId` request identity
- coarse-grained prefill/decode/unified/drop step protocol
- rank-local worker-owned model state
- rank-local CUDA context, cuBLAS, graph, and NCCL resources
- hidden all-reduce after row-parallel projections
- replicated embedding/lm_head as the first-pass simplification

Qwen3.5-specific design work should stay focused on model geometry and state ownership: hybrid layer layout, gated q projection, linear-attention conv state, and GDR recurrent state.

## Boundaries

This design does not cover multi-node TP, data parallelism, pipeline parallelism, vocab-parallel embedding/lm_head, or Qwen3.5 prefix-cache/recurrent-state snapshots.

Phase 1 does not shard linear attention or change GDR kernel shapes. Phase 2 does not change GDR math, does not all-reduce recurrent state, and does not move recurrent state ownership back into the scheduler.

## Settled Phase 1 Contract

These decisions are settled before implementation starts.

- `TP=1` must preserve the current single-GPU behavior.
- `TP=2` is the first correctness target. The implementation may stay degree-parametric, but unsupported or indivisible degrees must fail before model load.
- `TP > 1` is eager-only in Phase 1. CUDA Graph under TP must fail closed instead of silently falling back or partially capturing.
- Reuse the Qwen3 controller/worker broadcast execution model and avoid a second long-lived Qwen3.5-specific TP runtime shape.
- Shard dense full-attention and MLP operators.
- Replicate embedding and tied `lm_head`.
- Replicate linear-attention/GDR weights in Phase 1.
- Each rank worker owns and mutates its own full linear-attention conv state and GDR recurrent state copy.
- The scheduler owns logical request lifecycle and logical KV/page lifecycle only.
- Full-attention KV is physically rank-local and sharded by local KV heads, but one logical request/page assignment is mirrored across all ranks.
- `DropRequest`, finish cleanup, cancellation cleanup, and slot reuse must release or reset the corresponding rank-local KV/recurrent/conv state on every rank.
- Qwen3.5 gated `q_proj` slicing is an explicit acceptance gate: every rank must receive both q rows and gate rows for its local query heads.
- MLP gate/up row sharding and down column sharding require explicit reconstruction or layout tests.

## Still Open / Future Discussion

These topics should not block Phase 1 eager dense TP, but they remain design work before any later implementation.

- TP CUDA Graph support: graph state ownership per rank, synchronized capture/replay order, NCCL capture behavior, graph padding slots, and recurrent/conv D2D slot compaction under capture.
- Sharded linear-attention/GDR execution: local GDR AOT kernel shapes, local recurrent-state layout, local conv state layout, and Phase 2 weight slicing.
- TP-aware prefix cache or recurrent-state snapshots.
- Vocab-parallel embedding or `lm_head`.
- Multi-node TP, data parallelism, and pipeline parallelism.
- Performance optimization claims. Phase 1 is a correctness/runtime milestone, not a throughput milestone.

## Why Dense First, GDR Second

Qwen3.5 has two separable TP problems.

The dense part is already proven by Qwen3: full-attention head sharding, local KV heads, MLP intermediate sharding, all-reduce after row-parallel projections, and worker-thread CUDA/NCCL execution.

The linear-attention part is Qwen3.5-specific: conv state and GDR recurrent state are long-lived request state, current GDR AOT kernels are built for the global value-head shape, and slot compaction / graph padding / `DropRequest` must all preserve rank-local recurrent state. If dense TP and GDR TP land together, failures are hard to attribute. Phase 1 narrows correctness debugging to runtime + dense sharding; Phase 2 then isolates the GDR/recurrent contract.

## Architecture Summary

Qwen3.5-4B:

- 32 layers: 24 linear attention + 8 full attention
- full-attention layers: `3, 7, 11, 15, 19, 23, 27, 31`
- `hidden_size = 2560`
- `intermediate_size = 9216`
- tied embedding/lm_head
- `vocab_size = 248320`

Full attention:

- `num_attention_heads = 16`
- `num_key_value_heads = 4`
- `head_dim = 256`
- `q_dim = num_attention_heads * head_dim = 4096`
- `kv_dim = num_key_value_heads * head_dim = 1024`
- q projection includes an output gate, so gated q projection output dim is `2 * q_dim = 8192`

Linear attention:

- `linear_num_key_heads = 16`
- `linear_key_head_dim = 128`
- `linear_num_value_heads = 32`
- `linear_value_head_dim = 128`
- `linear_q_dim = linear_num_key_heads * linear_key_head_dim = 2048`
- `linear_k_dim = linear_q_dim`
- `linear_v_dim = linear_num_value_heads * linear_value_head_dim = 4096`
- `linear_qkv_dim = linear_q_dim + linear_k_dim + linear_v_dim = 8192`
- `linear_z_dim = linear_v_dim = 4096`
- recurrent state per linear layer: `[linear_num_value_heads, linear_key_head_dim, linear_value_head_dim] f32`
- conv state per linear layer: `linear_qkv_dim * (conv_kernel_dim - 1)` bf16

## Partition Contract

For any candidate `tp`, require:

- `num_attention_heads % tp == 0`
- `num_key_value_heads % tp == 0`
- `intermediate_size % tp == 0`
- Phase 2 additionally requires `linear_num_key_heads % tp == 0` and `linear_num_value_heads % tp == 0`

Full attention local dimensions:

- `local_q_heads = num_attention_heads / tp`
- `local_kv_heads = num_key_value_heads / tp`
- `local_q_dim = local_q_heads * head_dim`
- `local_kv_dim = local_kv_heads * head_dim`
- `local_gated_q_dim = 2 * local_q_dim`

Qwen3.5 full-attention `q_proj` must be sharded by head-local q/gate pairs. Each rank owns a contiguous query-head range, and for each owned head it must receive both that head's q rows and that head's gate rows. Do not reuse a naive contiguous row shard if the physical layout can split q rows from their gate rows.

MLP local dimensions:

- `local_intermediate = intermediate_size / tp`
- local fused `gate_up_proj` rows: `2 * local_intermediate`
- local `down_proj` input cols: `local_intermediate`

Linear-attention local dimensions for Phase 2:

- `local_linear_key_heads = linear_num_key_heads / tp`
- `local_linear_value_heads = linear_num_value_heads / tp`
- `local_linear_q_dim = local_linear_key_heads * linear_key_head_dim`
- `local_linear_k_dim = local_linear_q_dim`
- `local_linear_v_dim = local_linear_value_heads * linear_value_head_dim`
- `local_linear_qkv_dim = local_linear_q_dim + local_linear_k_dim + local_linear_v_dim`
- `local_linear_z_dim = local_linear_v_dim`
- local recurrent state: `[local_linear_value_heads, linear_key_head_dim, linear_value_head_dim] f32`
- local conv state: `local_linear_qkv_dim * (conv_kernel_dim - 1)` bf16

## Phase 1: Dense TP, Replicated Linear Attention

Shard:

- full-attention `q_proj`, `k_proj`, `v_proj`, `o_proj`
- full-attention KV cache over local KV heads
- MLP `gate_proj`, `up_proj`, `down_proj`

Replicate:

- embedding and tied lm_head
- all linear-attention weights
- all linear-attention conv state
- all GDR recurrent state
- existing GDR kernels and scratch shapes

Execution:

- full-attention: local q/k/v + local attention + local `o_proj`, then all-reduce hidden
- MLP: local gate/up + local activation + local `down_proj`, then all-reduce hidden
- linear attention: every rank runs the full layer and updates a full local recurrent-state copy; do not all-reduce replicated linear-attention output

State ownership:

- scheduler owns request admission, request identity, logical page allocation, streaming handles, sampling params, generation counters, and finish bookkeeping
- rank workers own rank-local model shards, rank-local physical KV buffers, rank-local decode buffers, and rank-local recurrent/conv state
- rank 0 is not special for state mutation; it follows the same worker command protocol as other ranks
- non-primary workers may return acknowledgement or step failure only, while the primary worker returns artifacts for scheduler-side result resolution
- all workers must observe the same ordered `RunPrefillStep`, `RunDecodeStep`, `RunUnifiedStep`, `DropRequest`, and `Shutdown` commands

CUDA Graph:

- Phase 1 TP execution is eager-only
- `tp_size > 1` with CUDA Graph enabled must return an explicit startup/configuration error before serving requests
- TP graph capture is a follow-up because Qwen3.5 graph state includes recurrent slots, slot compaction, padding slots, and NCCL ordering questions

Validation scope:

- first validated degree: `TP=2`
- Qwen3.5 HF logits gate
- Qwen3.5 scheduler e2e
- long prompt / chunked prefill path
- slot-compaction replay
- finish/drop followed by slot reuse without stale recurrent or conv state
- gated `q_proj` head-local q/gate slicing test
- MLP gate/up shard and down shard reconstruction/layout test
- basic TP2 serving smoke
- startup fails closed for unsupported or indivisible degrees
- startup fails closed for `tp_size > 1` with CUDA Graph enabled

## Phase 2: Sharded Linear Attention / GDR

Phase 2 converts linear attention from replicated execution to true TP execution.

Shard:

- `in_proj_qkv`, `in_proj_z`, `in_proj_b`, `in_proj_a`
- `dt_bias`, `A_log`
- conv state
- GDR recurrent state
- linear-attention `out_proj`

Execution:

- each rank computes local q/k/v/z/b/a
- each rank updates only local conv state and local GDR recurrent state
- each rank runs local gated RMSNorm/output-gate work
- each rank runs local `out_proj`
- all-reduce happens after `out_proj`

Never all-reduce GDR recurrent state or conv state. Their ownership is rank-local and request-local.

### vLLM Reference

Use vLLM's `Qwen3NextForCausalLM` / `QwenGatedDeltaNetAttention` as the reference contract, not as code to copy mechanically:

- GDN state shape depends on `tp_size`
- q/k/v/z projections are tensor-parallel column projections
- `out_proj` is row-parallel and reduces back to full hidden
- `dt_bias` and `A_log` are sharded over local value heads
- b/a projections are local-value-head aware; some quantized paths may replicate small projections and slice locally
- GDR prefill/decode kernels consume local head/state shapes

OpenInfer-specific work remains: worker-owned rank-local recurrent state, `RequestId` lifecycle, local-state slot compaction, `DropRequest` cleanup, and fail-closed kernel-shape validation.

Validation scope:

- Phase 1 gates still pass
- long HF logits replay under the validated degree
- slot compaction replay
- recurrent-state cleanup on finish/drop
- no stale local recurrent state after slot reuse

## References

- `docs/models/qwen3/tp-design.md`
- `openinfer-qwen3-4b/src/config.rs`
- `openinfer-qwen3-4b/src/executor.rs`
- `openinfer-qwen35-4b/src/config.rs`
- `openinfer-qwen35-4b/src/weights.rs`
- `openinfer-qwen35-4b/src/recurrent_state.rs`
- `openinfer-qwen35-4b/src/batch_decode.rs`
- vLLM `Qwen3NextForCausalLM`
- vLLM `QwenGatedDeltaNetAttention`