Skip to content

Conversation

@wcrzlh
Copy link
Contributor

@wcrzlh wcrzlh commented Nov 3, 2025

What does this PR do?

Fixes # (issue)

Adds # (feature)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline?
  • Did you make sure to update the documentation with your changes? E.g. record bug fixes or new features in What's New. Here are the
    documentation guidelines
  • Did you build and run the code without any errors?
  • Did you report the running environment (NPU type/MS version) and performance in the doc? (better record it for data loading, model inference, or training tasks)
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@xxx

@wcrzlh wcrzlh requested a review from vigo999 as a code owner November 3, 2025 06:28
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @wcrzlh, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the transformers library by integrating the Qwen3-Omni-MOE multimodal model, which unifies text, image, audio, and video processing. It introduces a suite of new utilities for handling diverse media inputs, alongside MindSpore-specific optimizations for model loading and data type management. The changes also update the auto-loading mechanisms within the framework to seamlessly support the new model and its specialized processors, ensuring a more comprehensive and efficient multimodal AI development experience.

Highlights

  • Qwen3-Omni-MOE Model Integration: Introduced the Qwen3-Omni-MOE model, a unified multimodal model capable of processing text, image, audio, and video inputs, along with its documentation and a quick start example.
  • Multimodal Processing Utilities: Added new utility functions for robust audio and vision processing, including smart_resize, fetch_image, fetch_video, smart_nframes, and process_mm_info to handle diverse media inputs.
  • MindSpore Model Loading and Dtype Management: Implemented MindSpore-specific enhancements for model loading, including direct safetensors loading via ms.load_checkpoint and dynamic patching of nn.Cell default dtypes to ensure correct precision during model initialization.
  • Expanded transformers Auto-Classes: Updated AutoModel, AutoProcessor, and introduced AutoVideoProcessor to automatically recognize and load Qwen3-Omni-MOE and its associated processors, streamlining model usage.
  • Enhanced Audio Feature Extraction: Integrated a new WhisperFeatureExtractor with support for dithering in spectrogram computation, improving audio processing robustness.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR introduces the Qwen3-Omni-MoE model, a large multi-modal model. The changes are extensive, touching not only the model implementation and examples but also core components of the mindone.transformers and mindone.diffusers libraries. While the feature addition is significant, the PR contains several critical issues that must be addressed before merging. These include TODOs left in the code, commented-out functionality causing regressions, incorrect API usage that will lead to runtime errors, and significant performance bottlenecks due to workarounds for framework limitations. The example README.md also contains incorrect installation instructions. I have provided detailed comments on these critical and high-severity issues.

Comment on lines +455 to +456
# TODO check difference of "ms.tensor.scatter" and "ms.tensor.scatter_"
scores_processed = scores.scatter_(1, input_ids, score)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The code was changed to use the in-place scatter_ method, but a TODO comment was added, indicating uncertainty about the difference. Using an in-place operation can introduce side effects by modifying the input scores tensor, which might be unexpected by the caller. Please confirm if this in-place modification is intended and safe. If not, the original out-of-place scatter operation should be used. The TODO comment must be removed before merging.

Suggested change
# TODO check difference of "ms.tensor.scatter" and "ms.tensor.scatter_"
scores_processed = scores.scatter_(1, input_ids, score)
scores_processed = scores.scatter(1, input_ids, score)


```

Text generation Outputs:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about the audio output? maybe we can attach the audio output as well

Copy link
Contributor Author

@wcrzlh wcrzlh Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The audio output quality is good. It retells the text output and summarizes the audio.
Let me figure out how to attach audio output.

causal_mask[i, :, :, j] = mask_function(batch_size[i], head_dim, cache_postion, kv_range[j].item())
else:
causal_mask = mint.zeros((q_len, kv_len), dtype=ms.bool_)
for i in range(kv_len):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is the efficiency of _vmap_patch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If only consideration "k" axis for non-vectorization, the performance is similar to "mindspore.vmap".
But sometimes padding_mask_func would be used as part of mask_func. If considering batch size axis for non-vectorization, the performance is slower than "mindspore.vmap". Right now I am thinking about better substitution for "mindspore.vmap".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants