[Feature] Expert parallelism support #1435

chongli-uw · 2024-09-16T06:02:47Z

Checklist

1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
2. Please use English, otherwise it will be closed.

Motivation

Hi team,
First of all thanks so much for such a great project. I am wondering if there is plan to support Expert Parallelism for MoE models?

Related resources

https://nvidia.github.io/TensorRT-LLM/advanced/expert-parallelism.html

merrymercy · 2024-09-22T11:19:54Z

sglang/python/sglang/srt/models/mixtral_quant.py

Lines 86 to 150 in 441c22d

    
           class MixtralMoE(nn.Module): 
        
               def __init__( 
        
                   self, 
        
                   config: MixtralConfig, 
        
                   quant_config: Optional[QuantizationConfig] = None, 
        
               ): 
        
                   super().__init__() 
        
                   self.config = config 
        
                   self.rank = get_tensor_model_parallel_rank() 
        
                   self.tp_size = get_tensor_model_parallel_world_size() 
        
                   self.num_total_experts = config.num_local_experts 
        
                   self.top_k = config.num_experts_per_tok 
        
                   if self.tp_size > self.num_total_experts: 
        
                       raise ValueError( 
        
                           f"Tensor parallel size {self.tp_size} is greater than " 
        
                           f"the number of experts {self.num_total_experts}." 
        
                       ) 
        
                   # Split experts equally between ranks 
        
                   self.expert_indicies = np.array_split( 
        
                       range(self.num_total_experts), self.tp_size 
        
                   )[self.rank].tolist() 
        
                   if not self.expert_indicies: 
        
                       raise ValueError(f"Rank {self.rank} has no experts assigned to it.") 
        
                   self.experts = nn.ModuleList( 
        
                       [ 
        
                           ( 
        
                               MixtralMLP( 
        
                                   self.num_total_experts, 
        
                                   config.hidden_size, 
        
                                   config.intermediate_size, 
        
                                   quant_config=quant_config, 
        
                               ) 
        
                               if idx in self.expert_indicies 
        
                               else None 
        
                           ) 
        
                           for idx in range(self.num_total_experts) 
        
                       ] 
        
                   ) 
        
                   self.gate = ReplicatedLinear( 
        
                       config.hidden_size, self.num_total_experts, bias=False, quant_config=None 
        
                   ) 
        
               def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: 
        
                   router_logits, _ = self.gate(hidden_states) 
        
                   routing_weights = F.softmax(router_logits, dim=1, dtype=torch.float) 
        
                   routing_weights, selected_experts = torch.topk( 
        
                       routing_weights, self.top_k, dim=-1 
        
                   ) 
        
                   routing_weights /= routing_weights.sum(dim=-1, keepdim=True) 
        
                   final_hidden_states = None 
        
                   for expert_idx in self.expert_indicies: 
        
                       expert_layer = self.experts[expert_idx] 
        
                       expert_mask = selected_experts == expert_idx 
        
                       expert_weights = (routing_weights * expert_mask).sum(dim=-1, keepdim=True) 
        
                       current_hidden_states = expert_layer(hidden_states).mul_(expert_weights) 
        
                       if final_hidden_states is None: 
        
                           final_hidden_states = current_hidden_states 
        
                       else: 
        
                           final_hidden_states.add_(current_hidden_states) 
        
                   return tensor_model_parallel_all_reduce(final_hidden_states)

this is an early example

Ying1123 added the enhancement New feature or request label Sep 16, 2024

zhyncs assigned merrymercy, ispobock and zhyncs Sep 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Expert parallelism support #1435

[Feature] Expert parallelism support #1435

chongli-uw commented Sep 16, 2024 •

edited

Loading

merrymercy commented Sep 22, 2024

[Feature] Expert parallelism support #1435

[Feature] Expert parallelism support #1435

Comments

chongli-uw commented Sep 16, 2024 • edited Loading

Checklist

Motivation

Related resources

merrymercy commented Sep 22, 2024

chongli-uw commented Sep 16, 2024 •

edited

Loading