[Feature]: AFD Decoupled Deployment + Separate Attn/FFN Weights with Asymmetric Flexible Allocation

### 🚀 The feature, motivation and pitch

It supports the decoupled deployment of AFD, where the weights of Attention (Attn) and Feed-Forward Network (FFN) are deployed separately, and an asymmetric and flexible ratio allocation is implemented.

### Motivation.
In large-scale Mixture-of-Experts (MoE) inference, the advantages of sparsity during the decoding phase necessitate a continuous expansion of expert parallelism (EP). Previous designs (e.g., DeepEP [1]) have enhanced throughput at scale but exhibit limited scalability due to the placement of EP shards across data-parallel (DP) ranks. Recently, Attention–FFN disaggregation (AFD) has been proposed by ByteDance [2], StepFun [3], and Huawei [4].

The underlying rationale is clear: the attention phase is constrained by memory, while the FFN/expert phase is limited by computational power. Therefore, a singular, homogeneous deployment cannot concurrently optimize both. Module-wise heterogeneous placement proves advantageous as EP continues to scale.

AFD introduces extra and frequent communication between Attention and FFN instances. Communication efficiency thus becomes a key factor for end-to-end throughput. In particular, for MoE models, two schemes can be used for A2F and F2A data transfer, the difference of which lies in whether the EP-related token dispatch/combine logic is coupled in A2F/F2A.

<img width="613" height="470" alt="Image" src="https://github.com/user-attachments/assets/66854a05-7e09-48ad-8ecf-0cc64b691c58" />

### Architecture Design: 

High-level design:

<img width="667" height="352" alt="Image" src="https://github.com/user-attachments/assets/e802001d-f54a-49ad-83e5-a0324d9c96e7" />

Deploy a global AFDConnector Manager to discover attention&ffn instances. The coordinator is deployed at start time and it waits for registration from each attention or ffn instance. The coordinator maintains the IPs, GPU ranks, etc. of the instances, which helps to set up communication groups or P2P connections.
Each vLLM worker deploys a AFDConnector Worker that orchestrates communication resources and registers necessary information to the Coordinator.
The Manager collects the local expert information from each Worker and broadcasts them back to all Workers. Each Worker will hold the location information of all the physical experts after the start up phase.


### Alternatives

_No response_

### Additional context

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature]: AFD Decoupled Deployment + Separate Attn/FFN Weights with Asymmetric Flexible Allocation #195

🚀 The feature, motivation and pitch

Motivation.

Architecture Design:

Alternatives

Additional context

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: AFD Decoupled Deployment + Separate Attn/FFN Weights with Asymmetric Flexible Allocation #195

Description

🚀 The feature, motivation and pitch

Motivation.

Architecture Design:

Alternatives

Additional context

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions