Skip to content

docs(qwen35): draft TP2 phased design#450

Open
Mrtroll486 wants to merge 3 commits into
openinfer-project:mainfrom
Mrtroll486:docs/qwen35-tp-design
Open

docs(qwen35): draft TP2 phased design#450
Mrtroll486 wants to merge 3 commits into
openinfer-project:mainfrom
Mrtroll486:docs/qwen35-tp-design

Conversation

@Mrtroll486

Copy link
Copy Markdown
Contributor

Description

Refs #446

Drafts the Qwen3.5 TP2 design as a Qwen3 TP follow-up rather than a new parallel-runtime proposal.

The doc scopes Qwen3.5 tensor parallelism around reusing the existing Qwen3 controller/worker TP runtime, then splits the work into two phases:

  • Phase 1: shard full-attention + MLP while keeping linear attention/GDR replicated, so the dense TP path and Qwen3.5 multi-rank runtime can be validated first.
  • Phase 2: shard linear attention, conv state, GDR recurrent state, and GDR kernels, using vLLM's Qwen3Next/GDN TP contract as the reference.

If this direction looks acceptable and the doc is merged, I plan to open follow-up issues from the phase breakdown in the doc rather than starting with a large implementation PR. I would especially appreciate feedback from the model/runtime owners and maintainers on the phase split, non-goals, acceptance criteria, and the vLLM reference contract before turning the design into implementation tasks.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update

Checklist

  • My code follows the style guidelines of this project (see docs/conventions/coding-style.md).
  • I have performed a self-review of my own code.
  • I have formatted my commits according to Commitizen conventions.
  • I have run the local test suite and all tests pass (see CLAUDE.md).

Docs-only; no runtime behavior, kernel code, scheduler code, or tests are changed.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f1433c552b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread docs/models/qwen35/tp-design.md Outdated
Comment thread docs/models/qwen35/tp-design.md Outdated
@xiaguan

xiaguan commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Direction looks good — reusing the Qwen3 runtime and splitting dense vs GDR/recurrent is the right call. Two pieces of feedback:

1. Tighten it. The doc is longer than it needs to be. The six separate non-goal lists and the repeated per-phase acceptance lists carry a lot of redundancy — a reader should get the phase split and the partition contract in about half the length.

2. Don't hard-lock TP=2. Qwen3's TP config is already degree-parametric (Config::local_num_attention_heads(tp), local_q_dim(tp), local_intermediate_size(tp), ...), so write the partition contract as formulas in tp rather than baking in 8 / 2048 / 4608. Make "only TP2 is validated first" a test-scope note and fail-closed on indivisible degrees — instead of "do not support TP>2" as an architectural non-goal. Full-attn (16 q / 4 KV heads) is divisibility-clean through TP4. It's the same code, just keeps the door open and matches what Qwen3 already does.

@Mrtroll486

Copy link
Copy Markdown
Contributor Author

Thanks for the review. I tightened the design doc and changed the TP framing to match the Qwen3 runtime better.

What changed:

  • Collapsed the repeated non-goal / acceptance sections into shorter Boundaries and
    validation-scope notes.
  • Rewrote the partition contract in terms of tp formulas instead of hard-coding TP2
    local sizes.
  • Kept TP=2 only as the first validation target, not as an architectural limit.
  • Added fail-closed divisibility requirements for candidate TP degrees.
  • Kept the Qwen3.5-specific q/gate head-pair sharding requirement.
  • Updated the docs index summary to match the new degree-parametric framing.

I left the implementation plan split intact: Phase 1 validates dense full-attn/MLP TP with replicated linear/GDR, and Phase 2 handles sharded linear attention / GDR state using the vLLM Qwen3Next/GDN contract as the reference.

If this is merged, I’ll open the two follow-up implementation issues myself for Phase 1 and Phase 2.

@xiaguan

xiaguan commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Thanks for the iterations — the direction is right. But the implementation side still has too many open questions to lock this in as a committed design doc: the scope of which operators actually change, CUDA Graph capture under TP, and how the NCCL group gets set up are all unresolved here.

Let's open an RFC issue and hash these out there before merging a design note. Could you open one (linked to #446) so we can discuss the implementation details properly?

@Mrtroll486

Copy link
Copy Markdown
Contributor Author

Thanks for the suggestion. I agree that I moved a bit too quickly toward a design note before the implementation details were fully clarified.

I’ll open an RFC issue and link it to #446, so we can properly discuss the operator scope, CUDA Graph capture under TP, and NCCL group setup before locking in the design.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants