feat(transformers): Add ColQwen2 (v4.54.1) #1414

wtomin · 2025-11-05T02:20:58Z

Add:
1. model:
  ColQwen2ForRetrieval
  ColQwen2PreTrainedModel
  ColQwen2Processor
2. fast UT: passed three UTs under pynative mode for [fp32, bf16, fp16]
Comments
Since this PR also edits Qwen2_VL, the fast UT of Qwen2_VL is also verified. Six fast UTs of Qwen2_VL are all passed under pynative for [fp32, bf16, fp16].

Better to validate with this PR #1421 .

Usage:

Colqwen2 example is adapted from transformers webpage

import requests
import mindspore as ms
from PIL import Image
from time import time
from mindone.transformers import ColQwen2ForRetrieval, ColQwen2Processor

# Load the model and the processor
model_name = "vidore/colqwen2-v1.0-hf"
s_time = time()
model = ColQwen2ForRetrieval.from_pretrained(
    model_name,
    mindspore_dtype=ms.bfloat16, 
    attn_implementation="eager",
)
processor = ColQwen2Processor.from_pretrained(model_name)
print(f"weight time: {time() - s_time}s")
# The document page screenshots from your corpus
url1 = "https://upload.wikimedia.org/wikipedia/commons/8/89/US-original-Declaration-1776.jpg"
url2 = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/4c/Romeoandjuliet1597.jpg/500px-Romeoandjuliet1597.jpg"

images = [
    Image.open("US-original-Declaration-1776.jpg"),
    Image.open("500px-Romeoandjuliet1597.jpg"),
]

# The queries you want to retrieve documents for
queries = [
    "When was the United States Declaration of Independence proclaimed?",
    "Who printed the edition of Romeo and Juliet?",
]

# Process the inputs
inputs_images = processor(images=images)
inputs_text = processor(text=queries)
#breakpoint()
inputs_text = dict([(k, ms.Tensor(v)) for k,v in inputs_text.items()])
# Forward pass
s_time  = time()
with ms._no_grad():
    image_embeddings = model(**inputs_images).embeddings
    query_embeddings = model(**inputs_text).embeddings

# Score the queries against the images
scores = processor.score_retrieval(query_embeddings, image_embeddings)
print(f"inference time: {time() - s_time}s")
print("Retrieval scores (query x image):")
print(scores)

torch bf16 output:

Retrieval scores (query x image):
tensor([[16.5000,   8.5000],
        [ 9.3750,  16.5000]], dtype=torch.bfloat16)

mindspore bf16 output:

Retrieval scores (query x image):
[[16.500000,   8.812500],
 [ 9.375000,  16.375000]]

Performance
Experiments are tested on Ascend Atlas 800T A2 machines with mindspore 2.7.0 pynative mode

model	weight loading time cost	inference speed
ColQwen2ForRetrieval	53.65s	14.18s/it

gemini-code-assist · 2025-11-05T02:21:16Z

Summary of Changes

Hello @wtomin, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the model capabilities by integrating the ColQwen2 model, which is designed for advanced document retrieval. By leveraging Vision-Language Models, ColQwen2 processes both visual and textual information from documents to generate multi-vector embeddings, streamlining the retrieval process and offering a robust alternative to traditional OCR-based methods. The changes include the core model implementation, a dedicated processor, and necessary updates to the framework's utility functions to support this new architecture.

Highlights

New Model Integration: Introduced the ColQwen2ForRetrieval model, ColQwen2PreTrainedModel, and ColQwen2Processor to the transformers library, enabling efficient document retrieval using Vision-Language Models (VLMs).
Performance Benchmarking: Initial performance metrics for ColQwen2ForRetrieval on Ascend Atlas 800T A2 machines show a weight loading time of 53.65s and an inference speed of 14.18s/it.
Testing and Validation: Fast unit tests for the new ColQwen2 model have passed under pynative mode for fp32, bf16, and fp16 data types, ensuring basic functionality and precision.
Refactored Weight Tying Logic: The modeling_utils.py file has been updated to refactor the tie_weights method into tie_embeddings_and_encoder_decoder and a new recursive tie_weights method, improving modularity and clarity for weight management.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces the ColQwen2 model, a vision-language model for retrieval. The implementation of the model and its processor appears to be correct, and the related changes in modeling_utils.py and qwen2_vl/modeling_qwen2_vl.py are positive refactorings that improve the code's structure. However, the accompanying test file for ColQwen2 is incomplete as it does not test the model's multimodal capabilities. Specifically, the vision pathway is not exercised, and the test input setup contains inconsistencies. It is crucial to add comprehensive tests for the vision aspect to ensure the model's correctness.

gemini-code-assist · 2025-11-05T02:23:53Z

tests/transformers_tests/models/colqwen2/test_modeling_colqwen2.py

+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        config, pixel_values = config_and_inputs
+        input_ids = ids_numpy([self.batch_size, self.seq_length], self.vocab_size)
+        attention_mask = np.ones(input_ids.shape, dtype=np.int64)
+
+        input_ids[:, -1] = self.pad_token_id
+        input_ids[input_ids == self.video_token_id] = self.pad_token_id
+        input_ids[input_ids == self.image_token_id] = self.pad_token_id
+        input_ids[input_ids == self.vision_start_token_id] = self.pad_token_id
+        input_ids[:, self.num_image_tokens] = self.image_token_id
+        input_ids[:, self.num_image_tokens - 1] = self.vision_start_token_id
+        inputs_dict = {
+            # "pixel_values": pixel_values,
+            "image_grid_thw": np.array([[1, 1, 1]] * self.batch_size),
+            "input_ids": input_ids,
+            "attention_mask": attention_mask,
+        }
+        return config, inputs_dict


The test setup in prepare_config_and_inputs_for_common is insufficient for testing the multimodal capabilities of ColQwen2ForRetrieval.

The pixel_values are commented out in inputs_dict, which means the vision part of the model is not being tested. This is a critical omission for a vision-language model.

If pixel_values were to be uncommented, its shape (total_patches, patch_dim) as prepared by prepare_config_and_inputs is incorrect. The model's construct method expects a padded batch of shape (batch_size, max_patches_per_image, patch_dim).

The image_grid_thw is hardcoded to [[1, 1, 1]]. This implies 1 patch per image, which is inconsistent with image_size=224 and patch_size=14 that would produce 16x16=256 patches.

Please update the test to correctly prepare the inputs and thoroughly test the vision pathway of the model.

wtomin added 21 commits October 28, 2025 17:25

model init

da44520

test init

bcd0172

fix import

70ec074

fix import

2e5b5f9

fix import

961244e

correct processor

dac7e92

correct processor

94ae263

new input in test script

72da1af

fix num_heads

02dbd53

test script similar to qwenvl

4e55ce9

test script similar to qwenvl

1a8d484

test script similar to qwenvl

caea07a

test script similar to qwenvl

fe5847d

test script similar to qwenvl

35dc4c0

test script similar to qwenvl

3ca441f

fix args error

a53cad5

fix output bf16

1e8a030

updates modeling_utils

3e02c0d

update qwen2

68426ec

set ms Tensor

193f5f9

set all to ms Tensor

d4f539b

wtomin requested a review from vigo999 as a code owner November 5, 2025 02:20

gemini-code-assist bot reviewed Nov 5, 2025

View reviewed changes

wtomin requested review from CaitinZhao, zhanghuiyao and zhtmike November 5, 2025 03:07

wtomin changed the title ~~feat(transformers): Add ColQwen2~~ feat(transformers): Add ColQwen2 (v4.54.1) Nov 7, 2025

wtomin added the new model add new model to mindone label Nov 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(transformers): Add ColQwen2 (v4.54.1) #1414

feat(transformers): Add ColQwen2 (v4.54.1) #1414

Uh oh!

wtomin commented Nov 5, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Nov 5, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat(transformers): Add ColQwen2 (v4.54.1) #1414

Are you sure you want to change the base?

feat(transformers): Add ColQwen2 (v4.54.1) #1414

Uh oh!

Conversation

wtomin commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Nov 5, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wtomin commented Nov 5, 2025 •

edited

Loading