(wip) support ultravox audio input #12745

ngxson · 2025-04-03T22:57:50Z

Current status: inference runs, but output gibberish

Original model: https://huggingface.co/fixie-ai/ultravox-v0_5-llama-3_2-1b

Why I do this?

Because ultravox seems to be a quite low-hanging fruit:

It uses whisper encoder, so indeed a whole lot of code in this PR is copied from whisper.cpp
It uses 2 matrices MLP to project from audio embd to text embd --> vision models already doing this
It uses vanilla llama 3.2 1B model without any fine-tuning

Application of this can we quite useful. Take an example of an app that can summarize a meeting based on audio:

Traditional audio processing pipeline is: audio --> text --> summary. Many acoustic features are lost in the audio --> text translation
With multimodal input, the pipeline will be: audio --> summary, a lot less latency and also all audio features are retained, including pauses, music, tone, pitch, etc

johnbenac · 2025-04-06T02:52:49Z

Do you think that this will work with the CSM GGUF pull request that you already implemented?

#12648

That was missing the encoder. I'm starting to work on the encoder (even though I really have no idea what I'm doing) but was this pull request an attempt to get encoding suitable for the GGUF CSM model?

ngxson added 2 commits April 3, 2025 16:11

(wip) convert ultravox-enc to gguf

62695aa

output but wrong

d44c721

github-actions bot added examples python python script changes labels Apr 3, 2025

ngxson mentioned this pull request Apr 4, 2025

llama : add llama_batch_ext #11875

Open

add conv layer

49193e2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(wip) support ultravox audio input #12745

(wip) support ultravox audio input #12745

ngxson commented Apr 3, 2025 •

edited

Loading

johnbenac commented Apr 6, 2025

(wip) support ultravox audio input #12745

Are you sure you want to change the base?

(wip) support ultravox audio input #12745

Conversation

ngxson commented Apr 3, 2025 • edited Loading

Why I do this?

johnbenac commented Apr 6, 2025

ngxson commented Apr 3, 2025 •

edited

Loading