-
Notifications
You must be signed in to change notification settings - Fork 89
feat(transformers): Add ColQwen2 (v4.54.1) #1414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @wtomin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands the model capabilities by integrating the ColQwen2 model, which is designed for advanced document retrieval. By leveraging Vision-Language Models, ColQwen2 processes both visual and textual information from documents to generate multi-vector embeddings, streamlining the retrieval process and offering a robust alternative to traditional OCR-based methods. The changes include the core model implementation, a dedicated processor, and necessary updates to the framework's utility functions to support this new architecture. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces the ColQwen2 model, a vision-language model for retrieval. The implementation of the model and its processor appears to be correct, and the related changes in modeling_utils.py and qwen2_vl/modeling_qwen2_vl.py are positive refactorings that improve the code's structure. However, the accompanying test file for ColQwen2 is incomplete as it does not test the model's multimodal capabilities. Specifically, the vision pathway is not exercised, and the test input setup contains inconsistencies. It is crucial to add comprehensive tests for the vision aspect to ensure the model's correctness.
| def prepare_config_and_inputs_for_common(self): | ||
| config_and_inputs = self.prepare_config_and_inputs() | ||
| config, pixel_values = config_and_inputs | ||
| input_ids = ids_numpy([self.batch_size, self.seq_length], self.vocab_size) | ||
| attention_mask = np.ones(input_ids.shape, dtype=np.int64) | ||
|
|
||
| input_ids[:, -1] = self.pad_token_id | ||
| input_ids[input_ids == self.video_token_id] = self.pad_token_id | ||
| input_ids[input_ids == self.image_token_id] = self.pad_token_id | ||
| input_ids[input_ids == self.vision_start_token_id] = self.pad_token_id | ||
| input_ids[:, self.num_image_tokens] = self.image_token_id | ||
| input_ids[:, self.num_image_tokens - 1] = self.vision_start_token_id | ||
| inputs_dict = { | ||
| # "pixel_values": pixel_values, | ||
| "image_grid_thw": np.array([[1, 1, 1]] * self.batch_size), | ||
| "input_ids": input_ids, | ||
| "attention_mask": attention_mask, | ||
| } | ||
| return config, inputs_dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test setup in prepare_config_and_inputs_for_common is insufficient for testing the multimodal capabilities of ColQwen2ForRetrieval.
- The
pixel_valuesare commented out ininputs_dict, which means the vision part of the model is not being tested. This is a critical omission for a vision-language model. - If
pixel_valueswere to be uncommented, its shape(total_patches, patch_dim)as prepared byprepare_config_and_inputsis incorrect. The model'sconstructmethod expects a padded batch of shape(batch_size, max_patches_per_image, patch_dim). - The
image_grid_thwis hardcoded to[[1, 1, 1]]. This implies 1 patch per image, which is inconsistent withimage_size=224andpatch_size=14that would produce16x16=256patches.
Please update the test to correctly prepare the inputs and thoroughly test the vision pathway of the model.
Add:
model:
ColQwen2ForRetrievalColQwen2PreTrainedModelColQwen2Processorfast UT: passed three UTs under pynative mode for [fp32, bf16, fp16]
Comments
Since this PR also edits
Qwen2_VL, the fast UT ofQwen2_VLis also verified. Six fast UTs ofQwen2_VLare all passed under pynative for [fp32, bf16, fp16].Better to validate with this PR #1421 .
Colqwen2 example is adapted from transformers webpage
torch bf16 output:
Retrieval scores (query x image): tensor([[16.5000, 8.5000], [ 9.3750, 16.5000]], dtype=torch.bfloat16)mindspore bf16 output:
Experiments are tested on Ascend Atlas 800T A2 machines with mindspore 2.7.0 pynative mode