Skip to content

[Feature]: AFD Decoupled Deployment + Separate Attn/FFN Weights with Asymmetric Flexible Allocation #195

@jiangkuaixue123

Description

@jiangkuaixue123

🚀 The feature, motivation and pitch

It supports the decoupled deployment of AFD, where the weights of Attention (Attn) and Feed-Forward Network (FFN) are deployed separately, and an asymmetric and flexible ratio allocation is implemented.

Motivation.

In large-scale Mixture-of-Experts (MoE) inference, the advantages of sparsity during the decoding phase necessitate a continuous expansion of expert parallelism (EP). Previous designs (e.g., DeepEP [1]) have enhanced throughput at scale but exhibit limited scalability due to the placement of EP shards across data-parallel (DP) ranks. Recently, Attention–FFN disaggregation (AFD) has been proposed by ByteDance [2], StepFun [3], and Huawei [4].

The underlying rationale is clear: the attention phase is constrained by memory, while the FFN/expert phase is limited by computational power. Therefore, a singular, homogeneous deployment cannot concurrently optimize both. Module-wise heterogeneous placement proves advantageous as EP continues to scale.

AFD introduces extra and frequent communication between Attention and FFN instances. Communication efficiency thus becomes a key factor for end-to-end throughput. In particular, for MoE models, two schemes can be used for A2F and F2A data transfer, the difference of which lies in whether the EP-related token dispatch/combine logic is coupled in A2F/F2A.

Image

Architecture Design:

High-level design:

Image

Deploy a global AFDConnector Manager to discover attention&ffn instances. The coordinator is deployed at start time and it waits for registration from each attention or ffn instance. The coordinator maintains the IPs, GPU ranks, etc. of the instances, which helps to set up communication groups or P2P connections.
Each vLLM worker deploys a AFDConnector Worker that orchestrates communication resources and registers necessary information to the Coordinator.
The Manager collects the local expert information from each Worker and broadcasts them back to all Workers. Each Worker will hold the location information of all the physical experts after the start up phase.

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions