feat(transformers): add qwen3-omni model #1411

wcrzlh · 2025-11-03T06:28:30Z

What does this PR do?

Fixes # (issue)

Adds # (feature)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you make sure to update the documentation with your changes? E.g. record bug fixes or new features in What's New. Here are the
documentation guidelines
Did you build and run the code without any errors?
Did you report the running environment (NPU type/MS version) and performance in the doc? (better record it for data loading, model inference, or training tasks)
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@xxx

…r/qwen2vl image processor fast/qwen3omni processor/Qwen3OmniMoeAudioEncoder

gemini-code-assist · 2025-11-03T06:29:13Z

Summary of Changes

Hello @wcrzlh, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the transformers library by integrating the Qwen3-Omni-MOE multimodal model, which unifies text, image, audio, and video processing. It introduces a suite of new utilities for handling diverse media inputs, alongside MindSpore-specific optimizations for model loading and data type management. The changes also update the auto-loading mechanisms within the framework to seamlessly support the new model and its specialized processors, ensuring a more comprehensive and efficient multimodal AI development experience.

Highlights

Qwen3-Omni-MOE Model Integration: Introduced the Qwen3-Omni-MOE model, a unified multimodal model capable of processing text, image, audio, and video inputs, along with its documentation and a quick start example.
Multimodal Processing Utilities: Added new utility functions for robust audio and vision processing, including smart_resize, fetch_image, fetch_video, smart_nframes, and process_mm_info to handle diverse media inputs.
MindSpore Model Loading and Dtype Management: Implemented MindSpore-specific enhancements for model loading, including direct safetensors loading via ms.load_checkpoint and dynamic patching of nn.Cell default dtypes to ensure correct precision during model initialization.
Expanded transformers Auto-Classes: Updated AutoModel, AutoProcessor, and introduced AutoVideoProcessor to automatically recognize and load Qwen3-Omni-MOE and its associated processors, streamlining model usage.
Enhanced Audio Feature Extraction: Integrated a new WhisperFeatureExtractor with support for dithering in spectrogram computation, improving audio processing robustness.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This PR introduces the Qwen3-Omni-MoE model, a large multi-modal model. The changes are extensive, touching not only the model implementation and examples but also core components of the mindone.transformers and mindone.diffusers libraries. While the feature addition is significant, the PR contains several critical issues that must be addressed before merging. These include TODOs left in the code, commented-out functionality causing regressions, incorrect API usage that will lead to runtime errors, and significant performance bottlenecks due to workarounds for framework limitations. The example README.md also contains incorrect installation instructions. I have provided detailed comments on these critical and high-severity issues.

examples/transformers/qwen3_omni_moe/README.md

mindone/transformers/masking_utils.py

mindone/transformers/modeling_utils.py

mindone/transformers/models/qwen2_vl/video_processing_qwen2_vl.py

gemini-code-assist · 2025-11-03T06:32:53Z

mindone/transformers/generation/logits_process.py

+        # TODO check difference of "ms.tensor.scatter" and "ms.tensor.scatter_"
+        scores_processed = scores.scatter_(1, input_ids, score)


The code was changed to use the in-place scatter_ method, but a TODO comment was added, indicating uncertainty about the difference. Using an in-place operation can introduce side effects by modifying the input scores tensor, which might be unexpected by the caller. Please confirm if this in-place modification is intended and safe. If not, the original out-of-place scatter operation should be used. The TODO comment must be removed before merging.

Suggested change

# TODO check difference of "ms.tensor.scatter" and "ms.tensor.scatter_"

scores_processed = scores.scatter_(1, input_ids, score)

scores_processed = scores.scatter(1, input_ids, score)

mindone/transformers/models/qwen2_vl/image_processing_qwen2_vl_fast.py

examples/transformers/qwen3_omni_moe/README.md

SamitHuang · 2025-11-16T09:46:48Z

examples/transformers/qwen3_omni_moe/README.md

+
+```
+
+Text generation Outputs:


how about the audio output? maybe we can attach the audio output as well

The audio output quality is good. It retells the text output and summarizes the audio.
Let me figure out how to attach audio output.

SamitHuang · 2025-11-16T09:53:36Z

mindone/transformers/masking_utils.py

+                causal_mask[i, :, :, j] = mask_function(batch_size[i], head_dim, cache_postion, kv_range[j].item())
+    else:
+        causal_mask = mint.zeros((q_len, kv_len), dtype=ms.bool_)
+        for i in range(kv_len):


how is the efficiency of _vmap_patch?

If only consideration "k" axis for non-vectorization, the performance is similar to "mindspore.vmap".
But sometimes padding_mask_func would be used as part of mask_func. If considering batch size axis for non-vectorization, the performance is slower than "mindspore.vmap". Right now I am thinking about better substitution for "mindspore.vmap".

wcrzlh added 30 commits October 27, 2025 15:18

feat(transformers): add Qwen2VLImageProcessorFast/Qwen2VLVideoProcessor

c9fc2db

feat(transformers): add Qwen2VLImageProcessorFast/Qwen2VLVideoProcessor

7290e26

feat(transformers): add Qwen2VLImageProcessorFast/Qwen2VLVideoProcessor

cf2d71c

feat(transformers): add WhisperFeatureExtractor/qwen2vl videoprocesso…

80d69f7

…r/qwen2vl image processor fast/qwen3omni processor/Qwen3OmniMoeAudioEncoder

fix bugs

7ca3e59

feat(transformers): add autoprocessor for qwen2audio

8228daf

pre-commit

b04d93c

feat(transformers): support qwen3-omni model

d3e6689

pre-commit

653c101

pre-commit

7c09e61

fix bugs

906a399

fix bugs

e40e718

fix bugs

302b13f

fix split ops bugs

99e5b70

fix pad_sequence bugs

5cfc0bb

fix audio padded_mask bugs/ supplement qwen_omni_utils

509bf1b

fix list += bug/mask_scatter bug

ea36b19

fix linspace bug

6470463

fix bugs

be04e46

fix repeat bugs

41a1491

fix view bugs

f9571bd

fix view bugs

8d354f7

fix arange bugs

1ae995c

fix arange bugs

b038159

fix arange bugs

7f5d9dd

fix arange bugs

52769e3

fix arange bugs

804ed81

fix arange bugs

4494189

fix construct wrapper bugs

ba8b103

fix slice index bugs

63b9f9c

wcrzlh added 18 commits October 29, 2025 16:57

fix hidden_states return bugs

efbaa3f

fix hidden_states return bugs

308a5f6

fix mint.cat dtype bugs

6b10c90

fix tensor index bugs

1ebee09

fix scatter bugs

bf5e2ff

fix bugs

6a1b803

fix bugs

a17e8cc

fix mint empty bugs

9240ade

fix mint empty bugs

824d5dd

fix mint empty bugs

c66d943

fix or_mask/and_mask bugs

180818c

fix np.prod bugs

2629a72

fix qwen_omni_utils bugs

265ea64

fix load weight time

56d2624

fix load weight time

1c1d47b

fix load weight time

dee9236

fix load weight time

95158bc

fix load weight time

82a6197

wcrzlh requested a review from vigo999 as a code owner November 3, 2025 06:28

gemini-code-assist bot reviewed Nov 3, 2025

View reviewed changes

wcrzlh force-pushed the vllm_patch branch from 24debc8 to 6e30e1e Compare November 4, 2025 12:39

add qwen3 omni ut and examples

b7a88d8

wcrzlh force-pushed the vllm_patch branch from 866c562 to b7a88d8 Compare November 7, 2025 07:16

wcrzlh added 3 commits November 7, 2025 15:31

pre-commit

c704520

rebase

5662e36

reformat

f06a68a

SamitHuang reviewed Nov 16, 2025

View reviewed changes

wcrzlh added 2 commits November 17, 2025 17:53

reformat

06ce942

supplement ut

efbe2f5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(transformers): add qwen3-omni model #1411

feat(transformers): add qwen3-omni model #1411

Uh oh!

wcrzlh commented Nov 3, 2025

Uh oh!

gemini-code-assist bot commented Nov 3, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Nov 3, 2025

Uh oh!

Uh oh!

Uh oh!

SamitHuang Nov 16, 2025

Uh oh!

wcrzlh Nov 17, 2025 •

edited

Loading

Uh oh!

SamitHuang Nov 16, 2025

Uh oh!

wcrzlh Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# TODO check difference of "ms.tensor.scatter" and "ms.tensor.scatter_"
		scores_processed = scores.scatter_(1, input_ids, score)

	# TODO check difference of "ms.tensor.scatter" and "ms.tensor.scatter_"
	scores_processed = scores.scatter_(1, input_ids, score)
	scores_processed = scores.scatter(1, input_ids, score)


		```

		Text generation Outputs:

feat(transformers): add qwen3-omni model #1411

Are you sure you want to change the base?

feat(transformers): add qwen3-omni model #1411

Uh oh!

Conversation

wcrzlh commented Nov 3, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

gemini-code-assist bot commented Nov 3, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

SamitHuang Nov 16, 2025

Choose a reason for hiding this comment

Uh oh!

wcrzlh Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SamitHuang Nov 16, 2025

Choose a reason for hiding this comment

Uh oh!

wcrzlh Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wcrzlh Nov 17, 2025 •

edited

Loading