Skip to content

Gemma 4: add audio and video multimodal support#343

Open
atlascodesai wants to merge 5 commits into
ml-explore:mainfrom
atlas-open-sources:gemma4-multimodal-pr-v2
Open

Gemma 4: add audio and video multimodal support#343
atlascodesai wants to merge 5 commits into
ml-explore:mainfrom
atlas-open-sources:gemma4-multimodal-pr-v2

Conversation

@atlascodesai

@atlascodesai atlascodesai commented Jun 12, 2026

Copy link
Copy Markdown

hiii, the below PR description and code was written by ai agents with my guidance. i've followed the testing and contribution guidance of this repo and have tested on a m5 macbook pro, iPhone 16 and iPhone 17.

we found something interesting where it seems that you can't access more than 6GB of memory on the iPhone 17 so if an apple engineer could comment on that it would be great! will be sharing this in the dev forums too as I did try to ask it on the WWDC AI/ML group lab and was pointed there.

hoping this can be merged so i can retire my fork :)

one other note, i added ~2MB of test media so the integration tests can validate real audio/video perception - per-file sources and licenses are in Tests/MLXLMTests/Resources/FIXTURES_LICENSES.md, happy to switch to run-time fetches if you'd rather not carry media in-repo.

thanks!


AI disclosure

This contribution was developed with AI assistance: implemented and reviewed by Anthropic's Claude (Fable 5, Opus 4.8) with an independent review pass by OpenAI's GPT-5.5 (Codex CLI). All changes were human-directed and verified end-to-end on real hardware before submission.

Proposed changes

Adds the two missing Gemma 4 modalities to MLXVLM, plus a checkpoint-loading fix.

This builds directly on prior community work:

Audio — Conformer/USM audio tower with a vDSP-based log-mel feature extractor (mel framing and filter bank validated against Python reference output; fixture-backed alignment tests included). Feature-extraction parameters (sampling_rate, num_mel_filters, fft_length, hop_length) and audio_seq_length are decoded from the checkpoint's processor_config.json rather than hardcoded. The prompt splice follows the reference <|audio><audio|> block format. LMInput.ProcessedAudio gains an optional padding mask (backward-compatible, defaulted). Token-count mismatches and multiple audio clips throw typed errors rather than crashing; one audio clip per request is supported for now.

Video — frame sampling via MediaProcessing with per-frame timestamped placeholder expansion. The frame budget comes from the config's video_max_frames (default 16) and can be tightened per request via a new UserInput.Processing.videoMaxFrames override (same pattern as the existing minPixels/maxPixels) — memory-constrained callers pass a lower cap, the engine stays platform-neutral. See the memory note at the end for why this exists.

Loading fix — Gemma 3n-style num_kv_shared_layers checkpoints: KV-shared tail layers are constructed without local K/V projections, and sanitize drops the redundant tail K/V weights non-QAT exports ship (scoped strictly to the text backbone so vision/audio tower weights are untouched).

Tests — hermetic unit tests (sanitize scoping, processor config decode, mel alignment vs Python fixtures) run in MLXLMTests with no downloads; end-to-end audio/video inference tests live in IntegrationTesting following the existing convention. Adds ~2 MB of media fixtures (LibriSpeech CC-BY-4.0, Big Buck Bunny CC-BY-3.0, own CC0 TTS clips — provenance in Tests/MLXLMTests/Resources/FIXTURES_LICENSES.md).

Verified end-to-end in Local AI Cat (a shipping app that consumes this branch as its MLX engine) on iPhone 16 Pro Max, iPhone 17 Pro Max, and Apple Silicon macOS, across text / image / audio / video / combined prompts. The throughput figures are from those on-device runs (Gemma 4 E2B QAT 4-bit: ~29 tok/s on the 17 Pro Max, ~19 tok/s on the 16 Pro Max).

On scope: we kept audio and video in one PR because they share the prompt-splice and LMInput plumbing and were verified together (a single combined image + video + audio prompt is one of the test cases) — but happy to split if you'd prefer; the shared-KV loading fix in particular is self-contained and could land separately first.

Checklist

  • I have read the CONTRIBUTING document
  • I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the necessary documentation (if needed)

Note on the iPhone memory ceiling (why videoMaxFrames exists)

We measured the per-app jetsam ceiling at a flat ~6144 MB on both an 8 GB iPhone 16 Pro Max and a 12 GB iPhone 17 Pro Max (with increased-memory-limit; the extra RAM only buys compute, not budget). With Gemma 4 E2B loaded (~4.4 GB) that leaves ~1.7 GB of prefill headroom — 8 video frames is robust under sustained thermal stress, 16 jetsam-crashes hot, hence the per-request cap. At the debug-only +512 MB ceiling, 16 frames becomes robust on the same hardware, so the limit looks like OS policy rather than silicon. If there's any path for a foreground on-device-ML app to use more of a 12 GB device, we'd love to hear it — filing a Feedback as well.

atlascodesai and others added 5 commits June 12, 2026 11:00
- Replace the iOS-only #if frame cap and GEMMA4_VIDEO_MAX_FRAMES env knob
  with a per-request UserInput.Processing.videoMaxFrames override
  (config.videoMaxFrames remains the upper bound)
- Throw Gemma4Error.audioTokenCountMismatch instead of crashing when the
  prompt's audio placeholder count diverges from the encoder output
- Throw on multiple audio inputs instead of silently dropping clips
- Honor processor_config.json feature_extractor block (sampling_rate,
  num_mel_filters, fft_length, hop_length) and audio_seq_length instead of
  hardcoding extractor defaults / 750
- Scope the useClippedLinears weight strip to the vision tower so audio
  clip parameters survive vision-only configs
- Delete dead Gemma4AudioRelativePositionEmbedding (never instantiated;
  superseded by the inline rel-pos logic in Gemma4AudioAttention)
- Delete no-op Gemma4AudioTests.swift and unused fixtures (incl. stray
  personal debugging artifacts); trim Package.swift resources
- Load the mel alignment fixture via Bundle.module, compare all 10x128
  reference values, rename to Gemma4AudioAlignmentTests.swift
- Document ProcessedAudio.mask polarity; drop unused processVideoFrame
  parameter; clean headers, comments, and integration-test prints
The alignment fixture was generated from the Python reference before the
Swift extractor adopted semicausal padding and bin-rounded mel triangles,
and the old assertions were too narrow (frame 0, first 10 bins) to notice:

- account for the one-frame offset the semicausal pad introduces
  (Swift frame f+1 covers the same samples as reference frame f)
- compare every signal-carrying reference value instead of a 10-value
  corner; exclude only the near-floor low bins where the rounded-bin
  filter bank deliberately diverges from the generation-era bank
- pin the deterministic collapsed-filter set of the rounded-bin bank
  exactly instead of asserting the (incorrect) at-most-one-empty rule
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant