docs(qwen35): draft TP2 phased design#450
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f1433c552b
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
Direction looks good — reusing the Qwen3 runtime and splitting dense vs GDR/recurrent is the right call. Two pieces of feedback: 1. Tighten it. The doc is longer than it needs to be. The six separate non-goal lists and the repeated per-phase acceptance lists carry a lot of redundancy — a reader should get the phase split and the partition contract in about half the length. 2. Don't hard-lock TP=2. Qwen3's TP config is already degree-parametric ( |
|
Thanks for the review. I tightened the design doc and changed the TP framing to match the Qwen3 runtime better. What changed:
I left the implementation plan split intact: Phase 1 validates dense full-attn/MLP TP with replicated linear/GDR, and Phase 2 handles sharded linear attention / GDR state using the vLLM Qwen3Next/GDN contract as the reference. If this is merged, I’ll open the two follow-up implementation issues myself for Phase 1 and Phase 2. |
|
Thanks for the iterations — the direction is right. But the implementation side still has too many open questions to lock this in as a committed design doc: the scope of which operators actually change, CUDA Graph capture under TP, and how the NCCL group gets set up are all unresolved here. Let's open an RFC issue and hash these out there before merging a design note. Could you open one (linked to #446) so we can discuss the implementation details properly? |
|
Thanks for the suggestion. I agree that I moved a bit too quickly toward a design note before the implementation details were fully clarified. I’ll open an RFC issue and link it to #446, so we can properly discuss the operator scope, CUDA Graph capture under TP, and NCCL group setup before locking in the design. |
Description
Refs #446
Drafts the Qwen3.5 TP2 design as a Qwen3 TP follow-up rather than a new parallel-runtime proposal.
The doc scopes Qwen3.5 tensor parallelism around reusing the existing Qwen3 controller/worker TP runtime, then splits the work into two phases:
If this direction looks acceptable and the doc is merged, I plan to open follow-up issues from the phase breakdown in the doc rather than starting with a large implementation PR. I would especially appreciate feedback from the model/runtime owners and maintainers on the phase split, non-goals, acceptance criteria, and the vLLM reference contract before turning the design into implementation tasks.
Type of Change
Checklist
docs/conventions/coding-style.md).CLAUDE.md).Docs-only; no runtime behavior, kernel code, scheduler code, or tests are changed.