Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(wip) support ultravox audio input #12745

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from
Draft

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Apr 3, 2025

Current status: inference runs, but output gibberish

Original model: https://huggingface.co/fixie-ai/ultravox-v0_5-llama-3_2-1b

Why I do this?

Because ultravox seems to be a quite low-hanging fruit:

  • It uses whisper encoder, so indeed a whole lot of code in this PR is copied from whisper.cpp
  • It uses 2 matrices MLP to project from audio embd to text embd --> vision models already doing this
  • It uses vanilla llama 3.2 1B model without any fine-tuning

Application of this can we quite useful. Take an example of an app that can summarize a meeting based on audio:

  • Traditional audio processing pipeline is: audio --> text --> summary. Many acoustic features are lost in the audio --> text translation
  • With multimodal input, the pipeline will be: audio --> summary, a lot less latency and also all audio features are retained, including pauses, music, tone, pitch, etc

@github-actions github-actions bot added examples python python script changes labels Apr 3, 2025
@johnbenac
Copy link

Do you think that this will work with the CSM GGUF pull request that you already implemented?

#12648

That was missing the encoder. I'm starting to work on the encoder (even though I really have no idea what I'm doing) but was this pull request an attempt to get encoding suitable for the GGUF CSM model?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants