Qwen2-VL: implement M-RoPE in the language model (mirror of #239)#345
Open
mnmly wants to merge 1 commit into
Open
Qwen2-VL: implement M-RoPE in the language model (mirror of #239)#345mnmly wants to merge 1 commit into
mnmly wants to merge 1 commit into
Conversation
) The Qwen2-VL language model applied plain 1-D RoPE with offset = cache.offset and never built the 3-D multimodal position IDs Qwen2-VL requires; the applyMultimodalRotaryPositionEmbedding helper was dead code. Image tokens therefore received sequential positions and generated tokens landed ~max(grid) positions too high, so spatial/coordinate output drifted — the error grew with generation length (early tokens fine, later tokens skewed). Token identity was unaffected, so this surfaced as correct text in wrong/short bounding boxes for layout-style tasks (e.g. MinerU2.5). PR ml-explore#239 fixed exactly this for Qwen2.5-VL but left Qwen2-VL untouched. This is a direct port of that fix into Qwen2VL.swift; the LM spatial-position scheme is identical between the two models. Each ported site carries a `// ml-explore#239: Qwen25VL <symbol>` marker. Provenance map (Qwen25VL.swift): positionIdsKey / ropeDeltasKey ............... :13-14 Attention.mropeSectionRaw / _invFreq ......... :54 / :58 Attention.mropeCosSin ........................ :108 Attention.callAsFunction positionIds branch .. :144 Qwen25VLDecoderLayer.callAsFunction .......... :208 Qwen25Model.callAsFunction ................... :240 LanguageModel.callAsFunction state+delta ..... :279-316 inputEmbeddings .............................. :941 prepare(_:cache:windowSize:) ................. :982 Model.callAsFunction(_:LMInput.Text,...) ..... :1184 getRopeIndex (verbatim) ...................... :1027-1182 The three coupled fixes match ml-explore#239: MROPE section layout via slice-replacement (mropeCosSin/mropeSectionRaw), rope_deltas reconstruction for decode, and _invFreq with a leading underscore so the weight loader skips it. Verified against the Python mlx-vlm reference: decoded layout boxes now match to the coordinate (previously a paragraph box came out one text-line short).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR #239 fixed exactly this for
Qwen2.5-VLbut leftQwen2-VLuntouched. This is pretty much a straight port of that fixes intoQwen2VL.swift; the LM spatial-position scheme is identical between the two models.I've added explicit inline-comments referring which parts of Qwen2.5-VL M-RoPE adaptations are ported here, but I will clean it all once this PR is reviewed and ready for merge.