Gemma 4: add audio and video multimodal support#343
Open
atlascodesai wants to merge 5 commits into
Open
Conversation
- Replace the iOS-only #if frame cap and GEMMA4_VIDEO_MAX_FRAMES env knob with a per-request UserInput.Processing.videoMaxFrames override (config.videoMaxFrames remains the upper bound) - Throw Gemma4Error.audioTokenCountMismatch instead of crashing when the prompt's audio placeholder count diverges from the encoder output - Throw on multiple audio inputs instead of silently dropping clips - Honor processor_config.json feature_extractor block (sampling_rate, num_mel_filters, fft_length, hop_length) and audio_seq_length instead of hardcoding extractor defaults / 750 - Scope the useClippedLinears weight strip to the vision tower so audio clip parameters survive vision-only configs - Delete dead Gemma4AudioRelativePositionEmbedding (never instantiated; superseded by the inline rel-pos logic in Gemma4AudioAttention) - Delete no-op Gemma4AudioTests.swift and unused fixtures (incl. stray personal debugging artifacts); trim Package.swift resources - Load the mel alignment fixture via Bundle.module, compare all 10x128 reference values, rename to Gemma4AudioAlignmentTests.swift - Document ProcessedAudio.mask polarity; drop unused processVideoFrame parameter; clean headers, comments, and integration-test prints
The alignment fixture was generated from the Python reference before the Swift extractor adopted semicausal padding and bin-rounded mel triangles, and the old assertions were too narrow (frame 0, first 10 bins) to notice: - account for the one-frame offset the semicausal pad introduces (Swift frame f+1 covers the same samples as reference frame f) - compare every signal-carrying reference value instead of a 10-value corner; exclude only the near-floor low bins where the rounded-bin filter bank deliberately diverges from the generation-era bank - pin the deterministic collapsed-filter set of the rounded-bin bank exactly instead of asserting the (incorrect) at-most-one-empty rule
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
hiii, the below PR description and code was written by ai agents with my guidance. i've followed the testing and contribution guidance of this repo and have tested on a m5 macbook pro, iPhone 16 and iPhone 17.
we found something interesting where it seems that you can't access more than 6GB of memory on the iPhone 17 so if an apple engineer could comment on that it would be great! will be sharing this in the dev forums too as I did try to ask it on the WWDC AI/ML group lab and was pointed there.
hoping this can be merged so i can retire my fork :)
one other note, i added ~2MB of test media so the integration tests can validate real audio/video perception - per-file sources and licenses are in Tests/MLXLMTests/Resources/FIXTURES_LICENSES.md, happy to switch to run-time fetches if you'd rather not carry media in-repo.
thanks!
AI disclosure
This contribution was developed with AI assistance: implemented and reviewed by Anthropic's Claude (Fable 5, Opus 4.8) with an independent review pass by OpenAI's GPT-5.5 (Codex CLI). All changes were human-directed and verified end-to-end on real hardware before submission.
Proposed changes
Adds the two missing Gemma 4 modalities to
MLXVLM, plus a checkpoint-loading fix.This builds directly on prior community work:
main's newerUserInput.AudioAPI, with the mel extractor and prompt splice corrected against the Python reference.gemma4(credited in the file header). Thanks to all three authors.Audio — Conformer/USM audio tower with a vDSP-based log-mel feature extractor (mel framing and filter bank validated against Python reference output; fixture-backed alignment tests included). Feature-extraction parameters (
sampling_rate,num_mel_filters,fft_length,hop_length) andaudio_seq_lengthare decoded from the checkpoint'sprocessor_config.jsonrather than hardcoded. The prompt splice follows the reference<|audio>…<audio|>block format.LMInput.ProcessedAudiogains an optional paddingmask(backward-compatible, defaulted). Token-count mismatches and multiple audio clips throw typed errors rather than crashing; one audio clip per request is supported for now.Video — frame sampling via
MediaProcessingwith per-frame timestamped placeholder expansion. The frame budget comes from the config'svideo_max_frames(default 16) and can be tightened per request via a newUserInput.Processing.videoMaxFramesoverride (same pattern as the existingminPixels/maxPixels) — memory-constrained callers pass a lower cap, the engine stays platform-neutral. See the memory note at the end for why this exists.Loading fix — Gemma 3n-style
num_kv_shared_layerscheckpoints: KV-shared tail layers are constructed without local K/V projections, andsanitizedrops the redundant tail K/V weights non-QAT exports ship (scoped strictly to the text backbone so vision/audio tower weights are untouched).Tests — hermetic unit tests (sanitize scoping, processor config decode, mel alignment vs Python fixtures) run in
MLXLMTestswith no downloads; end-to-end audio/video inference tests live inIntegrationTestingfollowing the existing convention. Adds ~2 MB of media fixtures (LibriSpeech CC-BY-4.0, Big Buck Bunny CC-BY-3.0, own CC0 TTS clips — provenance inTests/MLXLMTests/Resources/FIXTURES_LICENSES.md).Verified end-to-end in Local AI Cat (a shipping app that consumes this branch as its MLX engine) on iPhone 16 Pro Max, iPhone 17 Pro Max, and Apple Silicon macOS, across text / image / audio / video / combined prompts. The throughput figures are from those on-device runs (Gemma 4 E2B QAT 4-bit: ~29 tok/s on the 17 Pro Max, ~19 tok/s on the 16 Pro Max).
On scope: we kept audio and video in one PR because they share the prompt-splice and
LMInputplumbing and were verified together (a single combined image + video + audio prompt is one of the test cases) — but happy to split if you'd prefer; the shared-KV loading fix in particular is self-contained and could land separately first.Checklist
pre-commit run --all-filesto format my code / installed pre-commit prior to committing changesNote on the iPhone memory ceiling (why
videoMaxFramesexists)We measured the per-app jetsam ceiling at a flat ~6144 MB on both an 8 GB iPhone 16 Pro Max and a 12 GB iPhone 17 Pro Max (with
increased-memory-limit; the extra RAM only buys compute, not budget). With Gemma 4 E2B loaded (~4.4 GB) that leaves ~1.7 GB of prefill headroom — 8 video frames is robust under sustained thermal stress, 16 jetsam-crashes hot, hence the per-request cap. At the debug-only +512 MB ceiling, 16 frames becomes robust on the same hardware, so the limit looks like OS policy rather than silicon. If there's any path for a foreground on-device-ML app to use more of a 12 GB device, we'd love to hear it — filing a Feedback as well.