Gemma 4: add audio and video multimodal support by atlascodesai · Pull Request #343 · ml-explore/mlx-swift-lm

atlascodesai · 2026-06-12T20:56:23Z

hiii, the below PR description and code was written by ai agents with my guidance. i've followed the testing and contribution guidance of this repo and have tested on a m5 macbook pro, iPhone 16 and iPhone 17.

we found something interesting where it seems that you can't access more than 6GB of memory on the iPhone 17 so if an apple engineer could comment on that it would be great! will be sharing this in the dev forums too as I did try to ask it on the WWDC AI/ML group lab and was pointed there.

hoping this can be merged so i can retire my fork :)

one other note, i added ~2MB of test media so the integration tests can validate real audio/video perception - per-file sources and licenses are in Tests/MLXLMTests/Resources/FIXTURES_LICENSES.md, happy to switch to run-time fetches if you'd rather not carry media in-repo.

thanks!

AI disclosure

This contribution was developed with AI assistance: implemented and reviewed by Anthropic's Claude (Fable 5, Opus 4.8) with an independent review pass by OpenAI's GPT-5.5 (Codex CLI). All changes were human-directed and verified end-to-end on real hardware before submission.

Proposed changes

Adds the two missing Gemma 4 modalities to MLXVLM, plus a checkpoint-loading fix.

This builds directly on prior community work:

Add Gemma 4 video tower support #256 (Gemma 4 video tower) — the video path here starts from that PR and carries its processor tests forward.
Add Gemma 4 audio tower support (ASR via Conformer encoder) #192 (Gemma 4 audio tower) — the audio path began as a port of that PR, rebased onto main's newer UserInput.Audio API, with the mel extractor and prompt splice corrected against the Python reference.
VincentGourbin/gemma-4-swift-mlx served as a reference implementation during debugging, and the model port is based on the Python mlx-vlm gemma4 (credited in the file header). Thanks to all three authors.

Audio — Conformer/USM audio tower with a vDSP-based log-mel feature extractor (mel framing and filter bank validated against Python reference output; fixture-backed alignment tests included). Feature-extraction parameters (sampling_rate, num_mel_filters, fft_length, hop_length) and audio_seq_length are decoded from the checkpoint's processor_config.json rather than hardcoded. The prompt splice follows the reference <|audio> … <audio|> block format. LMInput.ProcessedAudio gains an optional padding mask (backward-compatible, defaulted). Token-count mismatches and multiple audio clips throw typed errors rather than crashing; one audio clip per request is supported for now.

Video — frame sampling via MediaProcessing with per-frame timestamped placeholder expansion. The frame budget comes from the config's video_max_frames (default 16) and can be tightened per request via a new UserInput.Processing.videoMaxFrames override (same pattern as the existing minPixels/maxPixels) — memory-constrained callers pass a lower cap, the engine stays platform-neutral. See the memory note at the end for why this exists.

Loading fix — Gemma 3n-style num_kv_shared_layers checkpoints: KV-shared tail layers are constructed without local K/V projections, and sanitize drops the redundant tail K/V weights non-QAT exports ship (scoped strictly to the text backbone so vision/audio tower weights are untouched).

Tests — hermetic unit tests (sanitize scoping, processor config decode, mel alignment vs Python fixtures) run in MLXLMTests with no downloads; end-to-end audio/video inference tests live in IntegrationTesting following the existing convention. Adds ~2 MB of media fixtures (LibriSpeech CC-BY-4.0, Big Buck Bunny CC-BY-3.0, own CC0 TTS clips — provenance in Tests/MLXLMTests/Resources/FIXTURES_LICENSES.md).

Verified end-to-end in Local AI Cat (a shipping app that consumes this branch as its MLX engine) on iPhone 16 Pro Max, iPhone 17 Pro Max, and Apple Silicon macOS, across text / image / audio / video / combined prompts. The throughput figures are from those on-device runs (Gemma 4 E2B QAT 4-bit: ~29 tok/s on the 17 Pro Max, ~19 tok/s on the 16 Pro Max).

On scope: we kept audio and video in one PR because they share the prompt-splice and LMInput plumbing and were verified together (a single combined image + video + audio prompt is one of the test cases) — but happy to split if you'd prefer; the shared-KV loading fix in particular is self-contained and could land separately first.

Checklist

I have read the CONTRIBUTING document
I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
I have added tests that prove my fix is effective or that my feature works
I have updated the necessary documentation (if needed)

Note on the iPhone memory ceiling (why `videoMaxFrames` exists)

We measured the per-app jetsam ceiling at a flat ~6144 MB on both an 8 GB iPhone 16 Pro Max and a 12 GB iPhone 17 Pro Max (with increased-memory-limit; the extra RAM only buys compute, not budget). With Gemma 4 E2B loaded (~4.4 GB) that leaves ~1.7 GB of prefill headroom — 8 video frames is robust under sustained thermal stress, 16 jetsam-crashes hot, hence the per-request cap. At the debug-only +512 MB ceiling, 16 frames becomes robust on the same hardware, so the limit looks like OS policy rather than silicon. If there's any path for a foreground on-device-ML app to use more of a 12 GB device, we'd love to hear it — filing a Feedback as well.

- Replace the iOS-only #if frame cap and GEMMA4_VIDEO_MAX_FRAMES env knob with a per-request UserInput.Processing.videoMaxFrames override (config.videoMaxFrames remains the upper bound) - Throw Gemma4Error.audioTokenCountMismatch instead of crashing when the prompt's audio placeholder count diverges from the encoder output - Throw on multiple audio inputs instead of silently dropping clips - Honor processor_config.json feature_extractor block (sampling_rate, num_mel_filters, fft_length, hop_length) and audio_seq_length instead of hardcoding extractor defaults / 750 - Scope the useClippedLinears weight strip to the vision tower so audio clip parameters survive vision-only configs - Delete dead Gemma4AudioRelativePositionEmbedding (never instantiated; superseded by the inline rel-pos logic in Gemma4AudioAttention) - Delete no-op Gemma4AudioTests.swift and unused fixtures (incl. stray personal debugging artifacts); trim Package.swift resources - Load the mel alignment fixture via Bundle.module, compare all 10x128 reference values, rename to Gemma4AudioAlignmentTests.swift - Document ProcessedAudio.mask polarity; drop unused processVideoFrame parameter; clean headers, comments, and integration-test prints

The alignment fixture was generated from the Python reference before the Swift extractor adopted semicausal padding and bin-rounded mel triangles, and the old assertions were too narrow (frame 0, first 10 bins) to notice: - account for the one-frame offset the semicausal pad introduces (Swift frame f+1 covers the same samples as reference frame f) - compare every signal-carrying reference value instead of a 10-value corner; exclude only the near-floor low bins where the rounded-bin filter bank deliberately diverges from the generation-era bank - pin the deterministic collapsed-filter set of the rounded-bin bank exactly instead of asserting the (incorrect) at-most-one-empty rule

atlascodesai and others added 5 commits June 12, 2026 11:00

feat(gemma4): add audio tower and fixtures

fdf2c65

feat(gemma4): add video tower and iOS frame cap

2c409ca

fix(gemma4): load shared KV tail layers without local modules

a1726e6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gemma 4: add audio and video multimodal support#343

Gemma 4: add audio and video multimodal support#343
atlascodesai wants to merge 5 commits into
ml-explore:mainfrom
atlas-open-sources:gemma4-multimodal-pr-v2

atlascodesai commented Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

atlascodesai commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI disclosure

Proposed changes

Checklist

Note on the iPhone memory ceiling (why videoMaxFrames exists)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

atlascodesai commented Jun 12, 2026 •

edited

Loading

Note on the iPhone memory ceiling (why `videoMaxFrames` exists)