Skip to content

Conversation

@wtomin
Copy link
Collaborator

@wtomin wtomin commented Nov 5, 2025

  • Add:

    1. model:
      ColQwen2ForRetrieval
      ColQwen2PreTrainedModel
      ColQwen2Processor

    2. fast UT: passed three UTs under pynative mode for [fp32, bf16, fp16]

  • Comments
    Since this PR also edits Qwen2_VL, the fast UT of Qwen2_VL is also verified. Six fast UTs of Qwen2_VL are all passed under pynative for [fp32, bf16, fp16].

Better to validate with this PR #1421 .

  • Usage:

Colqwen2 example is adapted from transformers webpage

import requests
import mindspore as ms
from PIL import Image
from time import time
from mindone.transformers import ColQwen2ForRetrieval, ColQwen2Processor

# Load the model and the processor
model_name = "vidore/colqwen2-v1.0-hf"
s_time = time()
model = ColQwen2ForRetrieval.from_pretrained(
    model_name,
    mindspore_dtype=ms.bfloat16, 
    attn_implementation="eager",
)
processor = ColQwen2Processor.from_pretrained(model_name)
print(f"weight time: {time() - s_time}s")
# The document page screenshots from your corpus
url1 = "https://upload.wikimedia.org/wikipedia/commons/8/89/US-original-Declaration-1776.jpg"
url2 = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/4c/Romeoandjuliet1597.jpg/500px-Romeoandjuliet1597.jpg"

images = [
    Image.open("US-original-Declaration-1776.jpg"),
    Image.open("500px-Romeoandjuliet1597.jpg"),
]

# The queries you want to retrieve documents for
queries = [
    "When was the United States Declaration of Independence proclaimed?",
    "Who printed the edition of Romeo and Juliet?",
]

# Process the inputs
inputs_images = processor(images=images)
inputs_text = processor(text=queries)
#breakpoint()
inputs_text = dict([(k, ms.Tensor(v)) for k,v in inputs_text.items()])
# Forward pass
s_time  = time()
with ms._no_grad():
    image_embeddings = model(**inputs_images).embeddings
    query_embeddings = model(**inputs_text).embeddings

# Score the queries against the images
scores = processor.score_retrieval(query_embeddings, image_embeddings)
print(f"inference time: {time() - s_time}s")
print("Retrieval scores (query x image):")
print(scores)

torch bf16 output:

Retrieval scores (query x image):
tensor([[16.5000,   8.5000],
        [ 9.3750,  16.5000]], dtype=torch.bfloat16)

mindspore bf16 output:

Retrieval scores (query x image):
[[16.500000,   8.812500],
 [ 9.375000,  16.375000]]
  • Performance
    Experiments are tested on Ascend Atlas 800T A2 machines with mindspore 2.7.0 pynative mode
model weight loading time cost inference speed
ColQwen2ForRetrieval 53.65s 14.18s/it

@wtomin wtomin requested a review from vigo999 as a code owner November 5, 2025 02:20
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @wtomin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the model capabilities by integrating the ColQwen2 model, which is designed for advanced document retrieval. By leveraging Vision-Language Models, ColQwen2 processes both visual and textual information from documents to generate multi-vector embeddings, streamlining the retrieval process and offering a robust alternative to traditional OCR-based methods. The changes include the core model implementation, a dedicated processor, and necessary updates to the framework's utility functions to support this new architecture.

Highlights

  • New Model Integration: Introduced the ColQwen2ForRetrieval model, ColQwen2PreTrainedModel, and ColQwen2Processor to the transformers library, enabling efficient document retrieval using Vision-Language Models (VLMs).
  • Performance Benchmarking: Initial performance metrics for ColQwen2ForRetrieval on Ascend Atlas 800T A2 machines show a weight loading time of 53.65s and an inference speed of 14.18s/it.
  • Testing and Validation: Fast unit tests for the new ColQwen2 model have passed under pynative mode for fp32, bf16, and fp16 data types, ensuring basic functionality and precision.
  • Refactored Weight Tying Logic: The modeling_utils.py file has been updated to refactor the tie_weights method into tie_embeddings_and_encoder_decoder and a new recursive tie_weights method, improving modularity and clarity for weight management.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the ColQwen2 model, a vision-language model for retrieval. The implementation of the model and its processor appears to be correct, and the related changes in modeling_utils.py and qwen2_vl/modeling_qwen2_vl.py are positive refactorings that improve the code's structure. However, the accompanying test file for ColQwen2 is incomplete as it does not test the model's multimodal capabilities. Specifically, the vision pathway is not exercised, and the test input setup contains inconsistencies. It is crucial to add comprehensive tests for the vision aspect to ensure the model's correctness.

Comment on lines +125 to +143
def prepare_config_and_inputs_for_common(self):
config_and_inputs = self.prepare_config_and_inputs()
config, pixel_values = config_and_inputs
input_ids = ids_numpy([self.batch_size, self.seq_length], self.vocab_size)
attention_mask = np.ones(input_ids.shape, dtype=np.int64)

input_ids[:, -1] = self.pad_token_id
input_ids[input_ids == self.video_token_id] = self.pad_token_id
input_ids[input_ids == self.image_token_id] = self.pad_token_id
input_ids[input_ids == self.vision_start_token_id] = self.pad_token_id
input_ids[:, self.num_image_tokens] = self.image_token_id
input_ids[:, self.num_image_tokens - 1] = self.vision_start_token_id
inputs_dict = {
# "pixel_values": pixel_values,
"image_grid_thw": np.array([[1, 1, 1]] * self.batch_size),
"input_ids": input_ids,
"attention_mask": attention_mask,
}
return config, inputs_dict
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The test setup in prepare_config_and_inputs_for_common is insufficient for testing the multimodal capabilities of ColQwen2ForRetrieval.

  1. The pixel_values are commented out in inputs_dict, which means the vision part of the model is not being tested. This is a critical omission for a vision-language model.
  2. If pixel_values were to be uncommented, its shape (total_patches, patch_dim) as prepared by prepare_config_and_inputs is incorrect. The model's construct method expects a padded batch of shape (batch_size, max_patches_per_image, patch_dim).
  3. The image_grid_thw is hardcoded to [[1, 1, 1]]. This implies 1 patch per image, which is inconsistent with image_size=224 and patch_size=14 that would produce 16x16=256 patches.

Please update the test to correctly prepare the inputs and thoroughly test the vision pathway of the model.

@wtomin wtomin changed the title feat(transformers): Add ColQwen2 feat(transformers): Add ColQwen2 (v4.54.1) Nov 7, 2025
@wtomin wtomin added the new model add new model to mindone label Nov 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new model add new model to mindone

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant