Models: Qwen/Qwen3-ASR-1.7B and Qwen/Qwen3-ASR-0.6B
This document describes the model architecture, weight format, tokenizer layout,
and inference algorithm needed to implement Qwen3-ASR from scratch.
The Python reference implementation (python_simple_implementation.py) is the
executable version of this document.
Qwen3-ASR is a speech-to-text model with two main components:
- Audio Encoder (AuT): Conv2D downsampling + transformer encoder
- LLM Decoder (Qwen3): Standard Qwen3 transformer with Q/K norms and MRoPE
Pipeline:
WAV → 16kHz → Mel Spectrogram → Conv2D ×3 (8× downsample) → Transformer Encoder → Projector → Qwen3 Decoder → Tokens
| Parameter | 1.7B | 0.6B |
|---|---|---|
| Encoder d_model | 1024 | 896 |
| Encoder layers | 24 | 18 |
| Encoder heads | 16 | 14 |
| Encoder FFN dim | 4096 | 3584 |
| Encoder output_dim | 2048 | 1024 |
| Decoder hidden_size | 2048 | 1024 |
| Decoder layers | 28 | 28 |
| Decoder heads | 16 | 16 |
| Decoder KV heads | 8 | 8 |
| Decoder head_dim | 128 | 128 |
| Decoder intermediate | 6144 | 3072 |
| Vocab size | 151,936 | 151,936 |
| Parameter | Value |
|---|---|
| Sample rate | 16000 Hz |
| Mel bins | 128 |
| Hop length | 160 samples (10ms) |
| Window size (n_fft) | 400 samples (25ms) |
| Frame rate | 100 Hz (before downsampling) |
| Token rate | 12.5 Hz (after 8× conv downsample) |
Exact mel computation (WhisperFeatureExtractor):
- Window:
hann(window_size=400) - STFT:
torch.stft(audio, n_fft=400, hop_length=160, window=window, return_complex=True) - Power:
magnitudes = stft[..., :-1].abs() ** 2(drops last frame) - Mel filter bank: Slaney-style, 128 bins, 0-8000 Hz
mel_spec = mel_filters.T @ magnitudeslog_spec = log10(clamp(mel_spec, min=1e-10))- Dynamic range:
log_spec = max(log_spec, log_spec.max() - 8.0) - Normalize:
log_spec = (log_spec + 4.0) / 4.0
Note: Unlike Voxtral which uses a fixed global_log_mel_max=1.5, Qwen3-ASR uses
the dynamic maximum of the spectrogram for clamping.
CRITICAL: Per-chunk convolution. The encoder does NOT process the entire mel
spectrogram at once. It splits the mel into chunks of n_window*2 = 100 frames
and applies Conv2D independently per chunk. Each chunk of 100 frames produces
13 output tokens. The mel is NOT padded to 3000 frames.
CRITICAL: Windowed attention. The encoder uses windowed attention where tokens
can only attend within windows of tokens_per_chunk * (n_window_infer / chunk_size)
= 13 * (800/100) = 104 tokens. For audio longer than ~8 seconds, this creates
multiple attention windows. Tokens in different windows cannot attend to each other.
Per the paper (arXiv:2601.21337), the encoder uses "dynamic attention windows
ranging from 1s to 8s" during training. At inference, n_window_infer=800
(8 seconds) is used as the fixed window size.
CRITICAL: Per-chunk position embeddings. Sinusoidal position embeddings are applied per-chunk (each chunk starts from position 0), not globally.
Three Conv2D layers, each with stride=2 in both frequency and time dimensions:
conv2d1: Conv2d(in=1, out=480, kernel=3×3, stride=2, padding=1) → GELU
conv2d2: Conv2d(in=480, out=480, kernel=3×3, stride=2, padding=1) → GELU
conv2d3: Conv2d(in=480, out=480, kernel=3×3, stride=2, padding=1) → GELU
Input shape: [1, 1, 128, T] (batch, channel, mel_bins, time_frames)
After 3 convolutions, frequency dimension: 128 → 64 → 32 → 16
Output is reshaped: [1, 480, 16, T/8] → permute → [1, T/8, 480×16] = [1, T/8, 7680]
Then projected: conv_out: Linear(7680 → d_model, no bias)
Standard sinusoidal embeddings (not RoPE) added after conv projection:
log_timescale_increment = log(10000) / (d_model/2 - 1)
inv_timescales = exp(-arange(d_model/2) * log_timescale_increment)
pe = concat(sin(pos * inv_timescales), cos(pos * inv_timescales)) # [seq, d_model]| Parameter | 1.7B | 0.6B |
|---|---|---|
| d_model | 1024 | 896 |
| n_layers | 24 | 18 |
| n_heads | 16 | 14 |
| head_dim | 64 | 64 |
| FFN dim | 4096 | 3584 |
| Norm | LayerNorm (with bias) | LayerNorm (with bias) |
| Attention | Full (bidirectional) | Full (bidirectional) |
| Biases | YES (all Q,K,V,Out,FC1,FC2 + norms) | YES |
Per-layer computation:
residual = h
h_norm = LayerNorm(h, self_attn_layer_norm)
q = h_norm @ Wq + bq
k = h_norm @ Wk + bk
v = h_norm @ Wv + bv
attn_out = full_attention(q, k, v) # bidirectional, no mask
h = residual + (attn_out @ Wo + bo)
residual = h
h_norm = LayerNorm(h, final_layer_norm)
ffn_out = GELU(h_norm @ W_fc1 + b_fc1) @ W_fc2 + b_fc2
h = residual + ffn_out
After the final encoder LayerNorm (ln_post):
h = LayerNorm(h, ln_post) # with bias
h = GELU(h @ proj1 + b_proj1) # d_model → d_model
h = h @ proj2 + b_proj2 # d_model → output_dim (= decoder hidden_size)
For 1.7B: 1024 → 1024 → 2048 For 0.6B: 896 → 896 → 1024
| Parameter | 1.7B | 0.6B |
|---|---|---|
| hidden_size | 2048 | 1024 |
| n_layers | 28 | 28 |
| n_heads | 16 | 16 |
| n_kv_heads | 8 (GQA 2:1) | 8 (GQA 2:1) |
| head_dim | 128 | 128 |
| intermediate_size | 6144 | 3072 |
| Norm | RMSNorm (eps=1e-6) | RMSNorm (eps=1e-6) |
| Position | RoPE (theta=1e6, NeoX-style) | RoPE (theta=1e6, NeoX-style) |
| Attention | causal | causal |
| Biases | NO (none in decoder) | NO |
| Vocab size | 151,936 | 151,936 |
| Tied embeddings | yes (embed_tokens == lm_head) | yes |
The decoder applies per-head RMSNorm on Q and K after linear projection but before RoPE:
q = q_proj(h_norm) # [seq, n_heads * head_dim]
q = q.view(seq, n_heads, head_dim) # [seq, 16, 128]
q = RMSNorm_per_head(q, q_norm_weight) # normalize each head independently
# Then apply RoPEThe q_norm and k_norm weights have shape [head_dim] = [128].
The decoder uses standard NeoX-style RoPE (rotate_half):
inv_freq = 1.0 / (theta ** (arange(0, head_dim, 2) / head_dim)) # [64]
angles = positions * inv_freq # [seq, 64]
emb = cat(angles, angles) # [seq, 128] (duplicate for full head_dim)
cos, sin = emb.cos(), emb.sin()
# rotate_half: x1 = x[..., :64], x2 = x[..., 64:]
# result = x * cos + cat(-x2, x1) * sinNote: The config mentions MRoPE with mrope_section=[24,20,20] and interleaved=True.
For ASR (audio-only, no spatial dims), all three position dimensions are identical,
so MRoPE reduces to standard RoPE. The "interleaved" flag refers to how MRoPE
sections are mixed, not the per-pair rotation style.
Per-layer computation for hidden state h at positions pos..pos+seq-1:
- Input RMSNorm:
x = RMSNorm(h, input_layernorm, eps=1e-6) - QKV projections (GQA):
q = x @ Wq^T→[seq, n_heads×128]→ reshape[seq, 16, 128]k = x @ Wk^T→[seq, n_kv_heads×128]→ reshape[seq, 8, 128]v = x @ Wv^T→[seq, n_kv_heads×128]
- Per-head Q/K RMSNorm (eps=1e-6, weight shape [128])
- RoPE on Q and K (NeoX style, theta=1e6)
- KV cache: append K, V to per-layer cache
- Causal attention: scale=1/sqrt(128), GQA repeat 2:1
- Output projection + residual:
h = h + attn_out @ Wo^T - Post-attention RMSNorm:
h_norm = RMSNorm(h, post_attention_layernorm, eps=1e-6) - SwiGLU MLP + residual:
gate = silu(h_norm @ W_gate^T)up = h_norm @ W_up^Th = h + (gate * up) @ W_down^T
After last layer: h = RMSNorm(h, norm.weight), then logits = h @ lm_head^T.
<|endoftext|> = 151643 (pad token, EOS)
<|im_start|> = 151644
<|im_end|> = 151645 (EOS)
<|audio_start|> = 151669
<|audio_end|> = 151670
<|audio_pad|> = 151676 (placeholder for audio embeddings)
<asr_text> = 151704 (marks start of transcription text)
EOS token IDs: {151643, 151645}
Uses GPT-2 style byte-level BPE from vocab.json. The vocabulary maps
byte-encoded strings to token IDs. Characters are encoded using the GPT-2
bytes-to-unicode mapping (printable ASCII + extended Latin-1, with remaining
bytes mapped to Unicode chars starting at U+0100).
To decode: look up token string in inverted vocab → convert each character through reverse byte mapping → decode resulting bytes as UTF-8.
The prompt template for ASR:
<|im_start|>system\n<|im_end|>\n<|im_start|>user\n<|audio_start|><|audio_pad|>×N<|audio_end|><|im_end|>\n<|im_start|>assistant\n
As token IDs:
PREFIX: [151644, 8948, 198, 151645, 198, 151644, 872, 198, 151669]
AUDIO: [151676] × N_audio_tokens
SUFFIX: [151670, 151645, 198, 151644, 77091, 198]
Where N_audio_tokens equals the number of encoder output tokens (after 8× conv downsampling).
model-00001-of-00002.safetensors+model-00002-of-00002.safetensors: ~4.7 GB total, BF16model.safetensors.index.json: weight-to-shard mappingvocab.json+merges.txt: BPE tokenizerconfig.json: model configuration
model.safetensors: ~1.9 GB, BF16 (single file)- Same tokenizer and config files
Audio Encoder (prefix: thinker.audio_tower.):
conv2d1.weight [480, 1, 3, 3] + bias [480]
conv2d2.weight [480, 480, 3, 3] + bias [480]
conv2d3.weight [480, 480, 3, 3] + bias [480]
conv_out.weight [d_model, 7680] (no bias)
layers.{i}.self_attn.q_proj.weight [d_model, d_model] + bias
layers.{i}.self_attn.k_proj.weight [d_model, d_model] + bias
layers.{i}.self_attn.v_proj.weight [d_model, d_model] + bias
layers.{i}.self_attn.out_proj.weight [d_model, d_model] + bias
layers.{i}.self_attn_layer_norm.weight [d_model] + bias
layers.{i}.fc1.weight [ffn_dim, d_model] + bias
layers.{i}.fc2.weight [d_model, ffn_dim] + bias
layers.{i}.final_layer_norm.weight [d_model] + bias
ln_post.weight [d_model] + bias
proj1.weight [d_model, d_model] + bias
proj2.weight [output_dim, d_model] + bias
Token Embeddings:
thinker.model.embed_tokens.weight [151936, hidden_size]
LM Head (tied with embeddings):
thinker.lm_head.weight [151936, hidden_size]
LLM Decoder (prefix: thinker.model.layers.{i}.):
input_layernorm.weight [hidden_size]
self_attn.q_proj.weight [n_heads×128, hidden_size]
self_attn.k_proj.weight [n_kv_heads×128, hidden_size]
self_attn.v_proj.weight [n_kv_heads×128, hidden_size]
self_attn.o_proj.weight [hidden_size, n_heads×128]
self_attn.q_norm.weight [128] (per-head RMSNorm)
self_attn.k_norm.weight [128] (per-head RMSNorm)
post_attention_layernorm.weight [hidden_size]
mlp.gate_proj.weight [intermediate, hidden_size]
mlp.up_proj.weight [intermediate, hidden_size]
mlp.down_proj.weight [hidden_size, intermediate]
Plus thinker.model.norm.weight [hidden_size] (final norm). NO biases in decoder.
Unlike Voxtral which adds audio+text embeddings at every position, Qwen3-ASR
uses a replacement strategy: audio embeddings replace <|audio_pad|> token
embeddings at their positions.
- Build prompt: Construct input_ids with PREFIX +
<|audio_pad|>×N+ SUFFIX - Embed tokens: Look up all token embeddings via
embed_tokens - Replace audio positions: Find positions where
input_ids == 151676and replace those embeddings with the corresponding audio encoder outputs - Prefill: Feed the combined embedding sequence through the decoder to build KV caches. Generate first token from last prefill position.
- Autoregressive decode: For each subsequent step, embed the previous token, feed through decoder, greedy argmax. Stop on EOS.
The model generates text in the format:
language English<asr_text>The actual transcription text.<|im_end|>
Parse by splitting on <asr_text> and taking the text after it.
The official qwen-asr Python package handles long audio via two modes:
non-streaming (batch) and streaming (incremental).
Found in qwen_asr/inference/utils.py and qwen_asr/inference/qwen3_asr.py.
Key constants:
SAMPLE_RATE = 16000
MAX_ASR_INPUT_SECONDS = 1200 # 20 minutes per chunk (!)
MAX_FORCE_ALIGN_INPUT_SECONDS = 180 # 3 minutes when using forced alignment
MIN_ASR_INPUT_SECONDS = 0.5 # minimum 500ms, zero-padded if shorterSegmentation function:
def split_audio_into_chunks(wav, sr, max_chunk_sec,
search_expand_sec=5.0,
min_window_ms=100.0):Algorithm:
- If audio <=
max_chunk_sec, process as single chunk. - Otherwise, iteratively split at
max_chunk_secboundaries:- Search within ±5 seconds around the cut point.
- Compute energy in 100ms sliding windows (using np.convolve).
- Split at the lowest-energy sample (silence/pause).
- Any chunk shorter than 0.5 seconds is zero-padded to 500ms.
- Each chunk is processed independently through the full pipeline (mel → encoder → decoder), and text results are concatenated.
Important: The default max chunk is 1200 seconds (20 minutes), meaning
the official pipeline almost never segments audio. The 30-second chunk_length
in preprocessor_config.json is a legacy Whisper field and is NOT used.
No VAD (Voice Activity Detection) is used. The model handles silence natively.
Found in qwen_asr/inference/qwen3_asr.py (lines 584-830).
Key parameters (code defaults):
chunk_size_sec: 2.0 seconds (configurable)unfixed_chunk_num: 2 (first 2 chunks have no text prefix)unfixed_token_num: 5 (rollback: drop last 5 tokens from prefix)
Paper parameters (arXiv:2601.21337): "2-second chunk size, a 5-token fallback, and keeping the last four chunks unfixed." The paper says 4 unfixed chunks vs 2 in the code default — this may vary by evaluation setting.
Algorithm:
- Audio arrives in arbitrary-sized pieces, buffered until
chunk_size_secsamples accumulate. - On each trigger, ALL accumulated audio from the start is re-fed through the encoder (not just the new chunk).
- The decoder prompt includes previous transcription as a text prefix:
- First
unfixed_chunk_numchunks: No prefix (cold start). - Later chunks: Previous decoded text minus last
unfixed_token_numtokens is prepended. This "prefix rollback" reduces boundary jitter (the last few tokens may be unstable and get corrected with more context).
- First
- The decoder generates from where the prefix ends, producing new text.
- Output is the full accumulated transcription.
Critical detail: Streaming re-processes the entire audio through the encoder each time. This is O(n²) in audio length but bounded by the 2-second chunk interval. The prefix rollback strategy is what makes this produce coherent output despite incremental processing.
Our C pipeline uses a simplified approach: energy-based silence splitting (same algorithm as the official non-streaming mode) with configurable segment size. Each segment is processed independently with a fresh KV cache. Tokens are streamed to a callback as they are decoded.
This is a practical middle ground: shorter segments (e.g. 10 seconds) give lower latency output, while longer segments (e.g. 30+ seconds) give the model more context for better accuracy. The official pipeline's streaming mode with prefix rollback would require significant additional complexity (re-encoding all audio, managing text prefixes, token rollback).
Key facts from the Qwen3-ASR technical report:
Training: 4-stage pipeline:
- AuT encoder pretraining on ~40M hours of pseudo-labeled ASR data
- Omni pretraining with 3 trillion tokens (multi-modal)
- ASR supervised fine-tuning with multilingual + streaming + context biasing data
- Reinforcement learning (GSPO) on ~50k utterances
Languages: 30 languages + 22 Chinese dialects (52 total for ASR).
Encoder: Dynamic attention windows ranging 1s-8s during training.
At inference: 8s window (n_window_infer=800). AuT encoder is pretrained
separately, then integrated with the Qwen3 LLM.
Performance (0.6B): 92ms average time-to-first-token, RTF 0.064 at 128 concurrency. Can process ~2000 seconds of speech per second at scale.
Benchmarks: LibriSpeech WER 1.63-3.38 (1.7B), competitive with GPT-4o-Transcribe and Gemini-2.5-Pro. WenetSpeech CER 4.97-5.88, outperforming commercial APIs.