[BUG]: raise ValueError("No parameters found in moe_dp_group, please consider using HybridParallelPlugin instead") ValueError: No parameters found in moe_dp_group, please consider using HybridParallelPlugin instead #6237
Labels
bug
Something isn't working
Is there an existing issue for this bug?
The bug has not been fixed in the latest main branch
Do you feel comfortable sharing a concise (minimal) script that reproduces the error? :)
Yes, I will share a minimal reproducible script.
🐛 Describe the bug
我使用lora_finetune脚本微调r1时,发生上述错误,为什么检测不到moe的参数呢? 训练环境:64npu(910B32g),torch_npu==2.2.0 脚本命令:colossalai run --nproc_per_node 8 lora_finetune.py --pretrained /cache/DeepSeek-R1-671B-BF16/ --dataset lora_sft_data.jsonl --plugin moe --lr 2e-5 --max_length 256 -g --ep 4 --pp 16 --tp 1 --batch_size 24 --lora_rank 8 --lora_alpha 16 --num_epochs 2 --warmup_steps 8 --tensorboard_dir logs --save_dir DeepSeek-R1-bf16-lora --host ${VC_WORKER_HOSTS} --master_addr ${VC_WORKER_HOSTS%%,} --master_port 13123
Environment
No response
The text was updated successfully, but these errors were encountered: