[megatron, model] feat: qwen3.5 example #5381
[megatron, model] feat: qwen3.5 example #5381ISEEKYAN wants to merge 2 commits intoverl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request adds support for Qwen3.5 SFT with Megatron. The changes are mostly workarounds and fixes to support the Qwen3.5 architecture, particularly its Gated Delta Net (GDN) and chat template requirements. The changes look reasonable and well-commented, improving compatibility and robustness. I have one major concern about catching a broad Exception which could hide bugs.
| return_tensors="pt", | ||
| **apply_chat_template_kwargs, | ||
| ) | ||
| except (jinja2.exceptions.TemplateError, Exception) as e: |
There was a problem hiding this comment.
Catching a generic Exception is risky as it can suppress unexpected errors, making debugging difficult. It's better to catch more specific exceptions. Since jinja2.exceptions.TemplateError is a subclass of Exception, the tuple (jinja2.exceptions.TemplateError, Exception) is redundant and equivalent to except Exception:. Please replace Exception with the specific exception type(s) that are expected to contain the 'No user query' message. If the exact type is unknown, consider catching a narrower set of exceptions like ValueError or TypeError which are common for such issues.
|
@ISEEKYAN does this pr can also support rl? |
just updated a script with RL supports. But it is not easy to prepare a right vllm dependency now🥲 |
Many thanks. the vllm qwen3.5 during initialization, need to be fixed."so what issue is there with vllm qwen3.5 initialization? I see in vllm doc that vllm can indeed serve qwen3.5(https://docs.vllm.ai/projects/recipes/en/latest/Qwen/Qwen3.5.html). |
|
Successfully ran Qwen3.5 SFT (verl megatron example) with the following setup: (1) mbridge: install from source for qwen3_5 support — (2) megatron-core == 0.16.0 — required for attention_output_gate and other GDN options. (3) verl patch in verl/models/mcore/patch.py: applies the gate-slicing fix when Key library versions used:
|
What does this PR do?
thanks to @LiuXTao 's great work on ISEEKYAN/mbridge#83, the mbridge has supported qwen3.5.
This PR succeeded in running qwen3.5 SFT on verl based on mbridge supports for qwen3.5
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
see
examples/sft/gsm8k/run_qwen3_5_megatron.shand
examples/grpo_trainer/run_qwen3_5-35b-megatron.shAPI and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.