feat: add voice cloning support to TTS endpoint by rui8616 · Pull Request #492 · jundot/omlx

rui8616 · 2026-03-31T07:12:41Z

Summary

Add ref_audio (base64 data URI / URL) and ref_text parameters to /v1/audio/speech endpoint
Pass voice cloning params to mlx-audio's native generate() ICL path instead of requiring a separate generate_voice_clone() method
Auto-truncate reference audio to 15s max to prevent ICL context overflow that produces noise

Test plan

29 unit tests pass (including 7 voice-clone specific tests)
Verified with real Qwen3-TTS-12Hz-1.7B-Base model — voice cloning produces usable audio
Normal TTS (without ref_audio) unaffected

Extend /v1/audio/speech to support Qwen3-TTS Base voice cloning by adding ref_audio (base64 data URI or URL) and ref_text parameters. When ref_audio is provided, passes them to model.generate() which natively supports ICL voice cloning. Truncates reference audio to 15 seconds max to prevent ICL context overflow. Includes unit tests.

ethannortharc

Thanks for picking this up — the approach of using mlx-audio's native generate(ref_audio=, ref_text=) ICL path is the right call. A few things I noticed:

SSRF risk in `_decode_ref_audio`

urllib.request.urlretrieve will fetch any URL the user provides, including internal network addresses (http://169.254.169.254/, http://localhost:...). If the server is only used locally this is low risk, but worth considering if anyone exposes it on a network. Could restrict to validated external URLs or drop URL support and only accept base64.

Request body size limit

The audio file upload endpoints (/v1/audio/transcriptions, /v1/audio/process) have MAX_AUDIO_UPLOAD_BYTES protection via _read_upload(), but the JSON-based /v1/audio/speech endpoint has no equivalent body size limit. With ref_audio now carrying full audio files as base64, it might be worth adding a size check before base64.b64decode — something like rejecting payloads over a reasonable threshold. This isn't specific to your PR (the gap existed before), but since you're adding the first large-binary-in-JSON field, it would be a good place to address it.

15s truncation only works for WAV

_truncate_ref_audio catches wave.Error for non-WAV formats and skips truncation. If someone sends a long mp3/ogg as ref_audio, it would still overflow the ICL context. Could use mlx_audio.codec.load_audio (or similar) to handle more formats, or at least document the WAV-only limitation.

Minor

ref_text or "" converts None to empty string — some models might treat these differently
Temp file suffix is always .wav regardless of actual format — probably fine since load_audio detects from content, but worth noting

NetLops · 2026-04-06T08:58:16Z

Hi! Thanks for the great implementation. I'm really excited to try out this voice cloning feature. Any updates on when this PR might be merged?

jundot force-pushed the main branch 2 times, most recently from 2d46d30 to d0f5a38 Compare April 2, 2026 02:13

ethannortharc reviewed Apr 2, 2026

View reviewed changes

jundot mentioned this pull request Apr 4, 2026

Qwen3 TTS voice cloning is not working #566

Open

Merge branch 'main' into pr/voice-clone-upstream

c3483fb

rui8616 closed this Apr 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add voice cloning support to TTS endpoint#492

feat: add voice cloning support to TTS endpoint#492
rui8616 wants to merge 2 commits intojundot:mainfrom
rui8616:pr/voice-clone-upstream

rui8616 commented Mar 31, 2026

Uh oh!

ethannortharc left a comment

Uh oh!

NetLops commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rui8616 commented Mar 31, 2026

Summary

Test plan

Uh oh!

ethannortharc left a comment

Choose a reason for hiding this comment

SSRF risk in _decode_ref_audio

Request body size limit

15s truncation only works for WAV

Minor

Uh oh!

NetLops commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SSRF risk in `_decode_ref_audio`