Skip to content

Qwen2-VL: implement M-RoPE in the language model (mirror of #239)#345

Open
mnmly wants to merge 1 commit into
ml-explore:mainfrom
mnmly:qwen2vl-mrope
Open

Qwen2-VL: implement M-RoPE in the language model (mirror of #239)#345
mnmly wants to merge 1 commit into
ml-explore:mainfrom
mnmly:qwen2vl-mrope

Conversation

@mnmly

@mnmly mnmly commented Jun 13, 2026

Copy link
Copy Markdown

PR #239 fixed exactly this for Qwen2.5-VL but left Qwen2-VL untouched. This is pretty much a straight port of that fixes into Qwen2VL.swift; the LM spatial-position scheme is identical between the two models.

I've added explicit inline-comments referring which parts of Qwen2.5-VL M-RoPE adaptations are ported here, but I will clean it all once this PR is reviewed and ready for merge.

)

The Qwen2-VL language model applied plain 1-D RoPE with offset = cache.offset
and never built the 3-D multimodal position IDs Qwen2-VL requires; the
applyMultimodalRotaryPositionEmbedding helper was dead code. Image tokens
therefore received sequential positions and generated tokens landed ~max(grid)
positions too high, so spatial/coordinate output drifted — the error grew with
generation length (early tokens fine, later tokens skewed). Token identity was
unaffected, so this surfaced as correct text in wrong/short bounding boxes for
layout-style tasks (e.g. MinerU2.5).

PR ml-explore#239 fixed exactly this for Qwen2.5-VL but left Qwen2-VL untouched. This is a
direct port of that fix into Qwen2VL.swift; the LM spatial-position scheme is
identical between the two models. Each ported site carries a
`// ml-explore#239: Qwen25VL <symbol>` marker. Provenance map (Qwen25VL.swift):

  positionIdsKey / ropeDeltasKey ............... :13-14
  Attention.mropeSectionRaw / _invFreq ......... :54 / :58
  Attention.mropeCosSin ........................ :108
  Attention.callAsFunction positionIds branch .. :144
  Qwen25VLDecoderLayer.callAsFunction .......... :208
  Qwen25Model.callAsFunction ................... :240
  LanguageModel.callAsFunction state+delta ..... :279-316
  inputEmbeddings .............................. :941
  prepare(_:cache:windowSize:) ................. :982
  Model.callAsFunction(_:LMInput.Text,...) ..... :1184
  getRopeIndex (verbatim) ...................... :1027-1182

The three coupled fixes match ml-explore#239: MROPE section layout via slice-replacement
(mropeCosSin/mropeSectionRaw), rope_deltas reconstruction for decode, and
_invFreq with a leading underscore so the weight loader skips it.

Verified against the Python mlx-vlm reference: decoded layout boxes now match
to the coordinate (previously a paragraph box came out one text-line short).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant