-
Notifications
You must be signed in to change notification settings - Fork 226
Closed
Description
DeepSeek V3在训练时采用20250217的megatron版本,在该版本中TP与EP进行了解耦,对于MoE层不再强制使用与Attention层相同的TP,此时转换脚本save时的三层循环会导致对于expert强制使用TP,最后多保存ep_size/tp_size倍数的参数,导致最终ckpt占用存储空间过高。
for tp_rank in range(args.tensor_model_parallel_size):
for ep_rank in range(args.expert_model_parallel_size):
for pp_rank in range(args.pipeline_model_parallel_size):
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels