Qwen2-VL: implement M-RoPE in the language model (mirror of #239) by mnmly · Pull Request #345 · ml-explore/mlx-swift-lm

mnmly · 2026-06-13T14:08:47Z

PR #239 fixed exactly this for Qwen2.5-VL but left Qwen2-VL untouched. This is pretty much a straight port of that fixes into Qwen2VL.swift; the LM spatial-position scheme is identical between the two models.

I've added explicit inline-comments referring which parts of Qwen2.5-VL M-RoPE adaptations are ported here, but I will clean it all once this PR is reviewed and ready for merge.

) The Qwen2-VL language model applied plain 1-D RoPE with offset = cache.offset and never built the 3-D multimodal position IDs Qwen2-VL requires; the applyMultimodalRotaryPositionEmbedding helper was dead code. Image tokens therefore received sequential positions and generated tokens landed ~max(grid) positions too high, so spatial/coordinate output drifted — the error grew with generation length (early tokens fine, later tokens skewed). Token identity was unaffected, so this surfaced as correct text in wrong/short bounding boxes for layout-style tasks (e.g. MinerU2.5). PR ml-explore#239 fixed exactly this for Qwen2.5-VL but left Qwen2-VL untouched. This is a direct port of that fix into Qwen2VL.swift; the LM spatial-position scheme is identical between the two models. Each ported site carries a `// ml-explore#239: Qwen25VL <symbol>` marker. Provenance map (Qwen25VL.swift): positionIdsKey / ropeDeltasKey ............... :13-14 Attention.mropeSectionRaw / _invFreq ......... :54 / :58 Attention.mropeCosSin ........................ :108 Attention.callAsFunction positionIds branch .. :144 Qwen25VLDecoderLayer.callAsFunction .......... :208 Qwen25Model.callAsFunction ................... :240 LanguageModel.callAsFunction state+delta ..... :279-316 inputEmbeddings .............................. :941 prepare(_:cache:windowSize:) ................. :982 Model.callAsFunction(_:LMInput.Text,...) ..... :1184 getRopeIndex (verbatim) ...................... :1027-1182 The three coupled fixes match ml-explore#239: MROPE section layout via slice-replacement (mropeCosSin/mropeSectionRaw), rope_deltas reconstruction for decode, and _invFreq with a leading underscore so the weight loader skips it. Verified against the Python mlx-vlm reference: decoded layout boxes now match to the coordinate (previously a paragraph box came out one text-line short).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen2-VL: implement M-RoPE in the language model (mirror of #239)#345

Qwen2-VL: implement M-RoPE in the language model (mirror of #239)#345
mnmly wants to merge 1 commit into
ml-explore:mainfrom
mnmly:qwen2vl-mrope

mnmly commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mnmly commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant