-
Notifications
You must be signed in to change notification settings - Fork 89
feat(transformers): add qwen3-omni model #1411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
wcrzlh
wants to merge
55
commits into
mindspore-lab:master
Choose a base branch
from
wcrzlh:vllm_patch
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
55 commits
Select commit
Hold shift + click to select a range
c9fc2db
feat(transformers): add Qwen2VLImageProcessorFast/Qwen2VLVideoProcessor
wcrzlh 7290e26
feat(transformers): add Qwen2VLImageProcessorFast/Qwen2VLVideoProcessor
wcrzlh cf2d71c
feat(transformers): add Qwen2VLImageProcessorFast/Qwen2VLVideoProcessor
wcrzlh 80d69f7
feat(transformers): add WhisperFeatureExtractor/qwen2vl videoprocesso…
wcrzlh 7ca3e59
fix bugs
wcrzlh 8228daf
feat(transformers): add autoprocessor for qwen2audio
wcrzlh b04d93c
pre-commit
wcrzlh d3e6689
feat(transformers): support qwen3-omni model
wcrzlh 653c101
pre-commit
wcrzlh 7c09e61
pre-commit
wcrzlh 906a399
fix bugs
wcrzlh e40e718
fix bugs
wcrzlh 302b13f
fix bugs
wcrzlh 99e5b70
fix split ops bugs
wcrzlh 5cfc0bb
fix pad_sequence bugs
wcrzlh 509bf1b
fix audio padded_mask bugs/ supplement qwen_omni_utils
wcrzlh ea36b19
fix list += bug/mask_scatter bug
wcrzlh 6470463
fix linspace bug
wcrzlh be04e46
fix bugs
wcrzlh 41a1491
fix repeat bugs
wcrzlh f9571bd
fix view bugs
wcrzlh 8d354f7
fix view bugs
wcrzlh 1ae995c
fix arange bugs
wcrzlh b038159
fix arange bugs
wcrzlh 7f5d9dd
fix arange bugs
wcrzlh 52769e3
fix arange bugs
wcrzlh 804ed81
fix arange bugs
wcrzlh 4494189
fix arange bugs
wcrzlh ba8b103
fix construct wrapper bugs
wcrzlh 63b9f9c
fix slice index bugs
wcrzlh 85e5ec8
fix hidden_states return bugs
wcrzlh efbaa3f
fix hidden_states return bugs
wcrzlh 308a5f6
fix hidden_states return bugs
wcrzlh 6b10c90
fix mint.cat dtype bugs
wcrzlh 1ebee09
fix tensor index bugs
wcrzlh bf5e2ff
fix scatter bugs
wcrzlh 6a1b803
fix bugs
wcrzlh a17e8cc
fix bugs
wcrzlh 9240ade
fix mint empty bugs
wcrzlh 824d5dd
fix mint empty bugs
wcrzlh c66d943
fix mint empty bugs
wcrzlh 180818c
fix or_mask/and_mask bugs
wcrzlh 2629a72
fix np.prod bugs
wcrzlh 265ea64
fix qwen_omni_utils bugs
wcrzlh 56d2624
fix load weight time
wcrzlh 1c1d47b
fix load weight time
wcrzlh dee9236
fix load weight time
wcrzlh 95158bc
fix load weight time
wcrzlh 82a6197
fix load weight time
wcrzlh b7a88d8
add qwen3 omni ut and examples
wcrzlh c704520
pre-commit
wcrzlh 5662e36
rebase
wcrzlh f06a68a
reformat
wcrzlh 06ce942
reformat
wcrzlh efbe2f5
supplement ut
wcrzlh File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,151 @@ | ||
| # Qwen3-Omni | ||
|
|
||
| ## Introduction | ||
| The Qwen3-Omni-MOE model is a unified multiple modalities model proposed in Qwen3-Omni Technical Report from Qwen team, Alibaba Group. | ||
|
|
||
| The abstract from the technical report is the following: | ||
|
|
||
| *We present Qwen3-Omni, a single multimodal model that, for the first time, maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts. Qwen3-Omni matches the performance of same-sized single-modal models within the Qwen series and excels particularly on audio tasks. Across 36 audio and audio-visual benchmarks, Qwen3-Omni achieves open-source SOTA on 32 benchmarks and overall SOTA on 22, outperforming strong closed-source models such as Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe. Qwen3-Omni adopts a Thinker-Talker MoE architecture that unifies perception and generation across text, images, audio, and video, yielding fluent text and natural real-time speech. It supports text interaction in 119 languages, speech understanding in 19 languages, and speech generation in 10 languages. To reduce first-packet latency in streaming synthesis, Talker autoregressively predicts discrete speech codecs using a multi-codebook scheme. Leveraging the representational capacity of these codebooks, we replace computationally intensive block-wise diffusion with a lightweight causal ConvNet, enabling streaming from the first codec frame. In cold-start settings, Qwen3-Omni achieves a theoretical end-to-end first-packet latency of 234 ms. To further strengthen multimodal reasoning, we introduce a Thinking model that explicitly reasons over inputs from any modality. Since the research community currently lacks a general-purpose audio captioning model, we fine-tuned Qwen3-Omni-30B-A3B to obtain Qwen3-Omni-30B-A3B-Captioner, which produces detailed, low-hallucination captions for arbitrary audio inputs. Qwen3-Omni-30B-A3B, Qwen3-Omni-30B-A3B-Thinking, and Qwen3-Omni-30B-A3B-Captioner are publicly released under the Apache 2.0 license. | ||
|
|
||
| # Get Started | ||
|
|
||
| ## Requirements: | ||
| | mindspore | ascend driver | firmware | cann tookit/kernel | | ||
| |-----------|----------------|----------------|--------------------| | ||
| | 2.7.0 | 24.1.RC3.b080 | 7.5.T11.0.B088 | 8.1.RC1 | | ||
|
|
||
| ### Installation: | ||
| ``` | ||
| git clone https://github.com/mindspore-lab/mindone.git | ||
| cd mindone | ||
| pip install -e . | ||
|
|
||
| pip install transformers==4.57.1 | ||
|
|
||
| cd examples/transformers/qwen3_omni_moe | ||
| ``` | ||
|
|
||
| ## **Notice** | ||
| Note that adjusting `min_pixels` and `max_pixels` trades off between memory and accuracy. Please adjust min_pixel and max_pixel of processor if raising OOM error. | ||
|
|
||
| ## Quick Start | ||
|
|
||
| Here is a usage example of Qwen3-Omni-30B-A3B-Instruct. you can use the following command: | ||
|
|
||
| ```bash | ||
| # For Audio Understanding Task: | ||
| # If you want only return text, please set `return_audios=False` | ||
| msrun --worker_num=2 --local_worker_num=2 --master_port=8118 \ | ||
| --log_dir=msrun_log --join=True --cluster_time_out=300 \ | ||
| omni_understanding.py | ||
| ``` | ||
| Give it a try with various images, audios and prompts🤗🤗. | ||
|
|
||
| Omni Understanding Sample script: | ||
| `return_audio=False`could be set so that only text result would be returned. | ||
|
|
||
| ```python | ||
| from functools import partial | ||
|
|
||
| import numpy as np | ||
| import soundfile as sf | ||
| from qwen_omni_utils import process_mm_info | ||
|
|
||
| import mindspore as ms | ||
| import mindspore.mint.distributed as dist | ||
| from mindspore.communication import GlobalComm | ||
|
|
||
| from mindone.trainers.zero import prepare_network | ||
| from mindone.transformers import Qwen3OmniMoeForConditionalGeneration, Qwen3OmniMoeProcessor | ||
|
|
||
| # set up card communication | ||
| dist.init_process_group(backend="hccl") | ||
| ms.set_auto_parallel_context(parallel_mode="data_parallel") | ||
|
|
||
| MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Instruct" | ||
| # MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Thinking" | ||
|
|
||
| model = Qwen3OmniMoeForConditionalGeneration.from_pretrained( | ||
| MODEL_PATH, | ||
| mindspore_dtype=ms.bfloat16, | ||
| attn_implementation="flash_attention_2", | ||
| ) | ||
|
|
||
| # use zero3 parallel | ||
| shard_fn = partial(prepare_network, zero_stage=3, optimizer_parallel_group=GlobalComm.WORLD_COMM_GROUP) | ||
| model = shard_fn(model) | ||
|
|
||
| min_pixels = 56 * 56 | ||
| max_pixels = 14 * 14 * 768 | ||
| processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH, min_pixels=min_pixels, max_pixels=max_pixels) | ||
|
|
||
| conversation = [ | ||
| { | ||
| "role": "user", | ||
| "content": [ | ||
| {"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"}, | ||
| {"type": "audio", "audio": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav"}, | ||
| {"type": "text", "text": "What can you see and hear? Answer in one short sentence."}, | ||
| ], | ||
| }, | ||
| ] | ||
|
|
||
| # Set whether to use audio in video | ||
| USE_AUDIO_IN_VIDEO = True | ||
|
|
||
| # Preparation for inference | ||
| text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False) | ||
| audios, images, videos = process_mm_info(conversation, use_audio_in_video=USE_AUDIO_IN_VIDEO) | ||
| inputs = processor( | ||
| text=text, | ||
| audio=audios, | ||
| images=images, | ||
| videos=videos, | ||
| return_tensors="np", | ||
| padding=True, | ||
| use_audio_in_video=USE_AUDIO_IN_VIDEO, | ||
| ) | ||
|
|
||
| for key, value in inputs.items(): | ||
| if isinstance(value, np.ndarray): | ||
| inputs[key] = ms.tensor(value) | ||
| if inputs[key].dtype == ms.int64: | ||
| inputs[key] = inputs[key].to(ms.int32) | ||
| elif inputs[key].dtype != ms.int32: | ||
| inputs[key] = inputs[key].to(model.dtype) | ||
|
|
||
| # Inference: Generation of the output text and audio | ||
| text_ids, audio = model.generate( | ||
| **inputs, | ||
| speaker="Ethan", | ||
| thinker_return_dict_in_generate=True, | ||
| use_audio_in_video=USE_AUDIO_IN_VIDEO, | ||
| return_audio=False, | ||
| talker_do_sample=False, | ||
| ) | ||
|
|
||
| text = processor.batch_decode( | ||
| text_ids.sequences[:, inputs["input_ids"].shape[1] :], skip_special_tokens=True, clean_up_tokenization_spaces=False | ||
| ) | ||
| print(text) | ||
| if audio is not None: | ||
| sf.write( | ||
| "output.wav", | ||
| audio.reshape(-1).asnumpy(), | ||
| samplerate=24000, | ||
| ) | ||
|
|
||
| ``` | ||
|
|
||
| Text generation Outputs: | ||
| ``` | ||
| ['The image displays four luxury cars-a Rolls-Royce, a Mercedes-Benz SUV, a Ferrari convertible and a Porsche 911-while the audio captures a person coughing.'] | ||
| ``` | ||
|
|
||
| If `return_audio=True` is set, besides that above text generation results, a piece of audio that explains the image and audio would be generated. | ||
|
|
||
| ## Inference Speed | ||
| | model name | mindspore version | precision* | cards | Model part | attention type | tokens/s | | ||
| |:------------------------------:|:-----------------:|:----------:|:-----:|:----------:|:--------------:|:----------:| | ||
| | Qwen3-Omni-30B-A3B-Instruct | 2.7.0 | bf16 | 2 | Thinker | flash_attn | 0.73 | | ||
| | Qwen3-Omni-30B-A3B-Instruct | 2.7.0 | bf16 | 2 | Talker | flash_attn | 0.88 | | ||
115 changes: 115 additions & 0 deletions
115
examples/transformers/qwen3_omni_moe/omni_understanding.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,115 @@ | ||
| import argparse | ||
| from functools import partial | ||
|
|
||
| import numpy as np | ||
| import soundfile as sf | ||
| from qwen_omni_utils import process_mm_info | ||
|
|
||
| import mindspore as ms | ||
| import mindspore.mint.distributed as dist | ||
| from mindspore.communication import GlobalComm | ||
|
|
||
| from mindone.trainers.zero import prepare_network | ||
| from mindone.transformers import Qwen3OmniMoeForConditionalGeneration, Qwen3OmniMoeProcessor | ||
|
|
||
|
|
||
| def generate(args): | ||
| model = Qwen3OmniMoeForConditionalGeneration.from_pretrained( | ||
| args.model_name, | ||
| mindspore_dtype=ms.bfloat16, | ||
| attn_implementation="flash_attention_2", | ||
| ) | ||
|
|
||
| # use zero3 parallel | ||
| shard_fn = partial(prepare_network, zero_stage=3, optimizer_parallel_group=GlobalComm.WORLD_COMM_GROUP) | ||
| model.thinker = shard_fn(model.thinker) | ||
| model.talker = shard_fn(model.talker) | ||
|
|
||
| min_pixels = 56 * 56 | ||
| max_pixels = 14 * 14 * 768 | ||
| processor = Qwen3OmniMoeProcessor.from_pretrained(args.model_name, min_pixels=min_pixels, max_pixels=max_pixels) | ||
|
|
||
| conversation = [ | ||
| { | ||
| "role": "user", | ||
| "content": [ | ||
| {"type": "image", "image": args.image}, | ||
| {"type": "audio", "audio": args.audio}, | ||
| {"type": "text", "text": args.prompt}, | ||
| ], | ||
| }, | ||
| ] | ||
|
|
||
| # Set whether to use audio in video | ||
| USE_AUDIO_IN_VIDEO = True | ||
|
|
||
| # Preparation for inference | ||
| text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False) | ||
| audios, images, videos = process_mm_info(conversation, use_audio_in_video=USE_AUDIO_IN_VIDEO) | ||
| inputs = processor( | ||
| text=text, | ||
| audio=audios, | ||
| images=images, | ||
| videos=videos, | ||
| return_tensors="np", | ||
| padding=True, | ||
| use_audio_in_video=USE_AUDIO_IN_VIDEO, | ||
| ) | ||
|
|
||
| for key, value in inputs.items(): | ||
| if isinstance(value, np.ndarray): | ||
| inputs[key] = ms.tensor(value) | ||
| if inputs[key].dtype == ms.int64: | ||
| inputs[key] = inputs[key].to(ms.int32) | ||
| elif inputs[key].dtype != ms.int32: | ||
| inputs[key] = inputs[key].to(model.dtype) | ||
|
|
||
| # Inference: Generation of the output text and audio | ||
| text_ids, audio = model.generate( | ||
| **inputs, | ||
| speaker="Ethan", | ||
| thinker_return_dict_in_generate=True, | ||
| use_audio_in_video=USE_AUDIO_IN_VIDEO, | ||
| talker_do_sample=False, | ||
| ) | ||
|
|
||
| text = processor.batch_decode( | ||
| text_ids.sequences[:, inputs["input_ids"].shape[1] :], | ||
| skip_special_tokens=True, | ||
| clean_up_tokenization_spaces=False, | ||
| ) | ||
| print(text) | ||
| if audio is not None: | ||
| sf.write( | ||
| "output.wav", | ||
| audio.reshape(-1).asnumpy(), | ||
| samplerate=24000, | ||
| ) | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| parser = argparse.ArgumentParser(description="Qwen3OmniMoE demo.") | ||
|
|
||
| parser.add_argument("--prompt", type=str, default="What can you see and hear? Answer in one short sentence.") | ||
| parser.add_argument( | ||
| "--image", | ||
| type=str, | ||
| default="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg", | ||
| ) | ||
| parser.add_argument( | ||
| "--audio", | ||
| type=str, | ||
| default="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav", | ||
| ) | ||
| parser.add_argument( | ||
| "--model_name", type=str, default="Qwen/Qwen3-Omni-30B-A3B-Instruct", help="Path to the pre-trained model." | ||
| ) | ||
|
|
||
| # Parse the arguments | ||
| args = parser.parse_args() | ||
|
|
||
| # set up card communication | ||
| dist.init_process_group(backend="hccl") | ||
| ms.set_auto_parallel_context(parallel_mode="data_parallel") | ||
|
|
||
| generate(args) |
8 changes: 8 additions & 0 deletions
8
examples/transformers/qwen3_omni_moe/qwen_omni_utils/__init__.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| from .audio_process import process_audio_info | ||
| from .vision_process import extract_vision_info, fetch_image, fetch_video, process_vision_info, smart_resize | ||
|
|
||
|
|
||
| def process_mm_info(conversations, use_audio_in_video, return_video_kwargs=False): | ||
| audios = process_audio_info(conversations, use_audio_in_video) | ||
| vision = process_vision_info(conversations, return_video_kwargs=return_video_kwargs) | ||
| return (audios,) + vision |
55 changes: 55 additions & 0 deletions
55
examples/transformers/qwen3_omni_moe/qwen_omni_utils/audio_process.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,55 @@ | ||
| import audioread | ||
| import av | ||
| import librosa | ||
| import numpy as np | ||
|
|
||
|
|
||
| def _check_if_video_has_audio(video_path): | ||
| container = av.open(video_path) | ||
| audio_streams = [stream for stream in container.streams if stream.type == "audio"] | ||
| if not audio_streams: | ||
| return False | ||
| return True | ||
|
|
||
|
|
||
| def process_audio_info(conversations: list[dict], use_audio_in_video: bool): | ||
| audios = [] | ||
| if isinstance(conversations[0], dict): | ||
| conversations = [conversations] | ||
| for conversation in conversations: | ||
| for message in conversation: | ||
| if not isinstance(message["content"], list): | ||
| continue | ||
| for ele in message["content"]: | ||
| if ele["type"] == "audio": | ||
| if "audio" in ele: | ||
| path = ele["audio"] | ||
| if path.startswith("http://") or path.startswith("https://"): | ||
| audios.append(librosa.load(audioread.ffdec.FFmpegAudioFile(path), sr=16000)[0]) | ||
| elif isinstance(path, np.ndarray): | ||
| if path.ndim > 1: | ||
| raise ValueError("Support only mono audio") | ||
| audios.append(path) | ||
| elif path.startswith("file://"): | ||
| audios.append(librosa.load(path[len("file://") :], sr=16000)[0]) | ||
| else: | ||
| audios.append(librosa.load(path, sr=16000)[0]) | ||
| else: | ||
| raise ValueError("Unknown audio {}".format(ele)) | ||
| if use_audio_in_video and ele["type"] == "video": | ||
| if "video" in ele: | ||
| path = ele["video"] | ||
| assert _check_if_video_has_audio( | ||
| path | ||
| ), "Video must has audio track when use_audio_in_video=True" | ||
| if path.startswith("http://") or path.startswith("https://"): | ||
| audios.append(librosa.load(audioread.ffdec.FFmpegAudioFile(path), sr=16000)[0]) | ||
| elif path.startswith("file://"): | ||
| audios.append(librosa.load(path[len("file://") :], sr=16000)[0]) | ||
| else: | ||
| audios.append(librosa.load(path, sr=16000)[0]) | ||
| else: | ||
| raise ValueError("Unknown video {}".format(ele)) | ||
| if len(audios) == 0: | ||
| audios = None | ||
| return audios |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about the audio output? maybe we can attach the audio output as well
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The audio output quality is good. It retells the text output and summarizes the audio.
Let me figure out how to attach audio output.