Skip to content

feat: add voice cloning support to TTS endpoint#492

Closed
rui8616 wants to merge 2 commits intojundot:mainfrom
rui8616:pr/voice-clone-upstream
Closed

feat: add voice cloning support to TTS endpoint#492
rui8616 wants to merge 2 commits intojundot:mainfrom
rui8616:pr/voice-clone-upstream

Conversation

@rui8616
Copy link
Copy Markdown

@rui8616 rui8616 commented Mar 31, 2026

Summary

  • Add ref_audio (base64 data URI / URL) and ref_text parameters to /v1/audio/speech endpoint
  • Pass voice cloning params to mlx-audio's native generate() ICL path instead of requiring a separate generate_voice_clone() method
  • Auto-truncate reference audio to 15s max to prevent ICL context overflow that produces noise

Test plan

  • 29 unit tests pass (including 7 voice-clone specific tests)
  • Verified with real Qwen3-TTS-12Hz-1.7B-Base model — voice cloning produces usable audio
  • Normal TTS (without ref_audio) unaffected

Extend /v1/audio/speech to support Qwen3-TTS Base voice cloning by
adding ref_audio (base64 data URI or URL) and ref_text parameters.
When ref_audio is provided, passes them to model.generate() which
natively supports ICL voice cloning. Truncates reference audio to
15 seconds max to prevent ICL context overflow. Includes unit tests.
@jundot jundot force-pushed the main branch 2 times, most recently from 2d46d30 to d0f5a38 Compare April 2, 2026 02:13
Copy link
Copy Markdown
Contributor

@ethannortharc ethannortharc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this up — the approach of using mlx-audio's native generate(ref_audio=, ref_text=) ICL path is the right call. A few things I noticed:

SSRF risk in _decode_ref_audio

urllib.request.urlretrieve will fetch any URL the user provides, including internal network addresses (http://169.254.169.254/, http://localhost:...). If the server is only used locally this is low risk, but worth considering if anyone exposes it on a network. Could restrict to validated external URLs or drop URL support and only accept base64.

Request body size limit

The audio file upload endpoints (/v1/audio/transcriptions, /v1/audio/process) have MAX_AUDIO_UPLOAD_BYTES protection via _read_upload(), but the JSON-based /v1/audio/speech endpoint has no equivalent body size limit. With ref_audio now carrying full audio files as base64, it might be worth adding a size check before base64.b64decode — something like rejecting payloads over a reasonable threshold. This isn't specific to your PR (the gap existed before), but since you're adding the first large-binary-in-JSON field, it would be a good place to address it.

15s truncation only works for WAV

_truncate_ref_audio catches wave.Error for non-WAV formats and skips truncation. If someone sends a long mp3/ogg as ref_audio, it would still overflow the ICL context. Could use mlx_audio.codec.load_audio (or similar) to handle more formats, or at least document the WAV-only limitation.

Minor

  • ref_text or "" converts None to empty string — some models might treat these differently
  • Temp file suffix is always .wav regardless of actual format — probably fine since load_audio detects from content, but worth noting

@NetLops
Copy link
Copy Markdown

NetLops commented Apr 6, 2026

Hi! Thanks for the great implementation. I'm really excited to try out this voice cloning feature. Any updates on when this PR might be merged?

@rui8616 rui8616 closed this Apr 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants