feat: add voice cloning support to TTS endpoint#492
feat: add voice cloning support to TTS endpoint#492rui8616 wants to merge 2 commits intojundot:mainfrom
Conversation
Extend /v1/audio/speech to support Qwen3-TTS Base voice cloning by adding ref_audio (base64 data URI or URL) and ref_text parameters. When ref_audio is provided, passes them to model.generate() which natively supports ICL voice cloning. Truncates reference audio to 15 seconds max to prevent ICL context overflow. Includes unit tests.
2d46d30 to
d0f5a38
Compare
ethannortharc
left a comment
There was a problem hiding this comment.
Thanks for picking this up — the approach of using mlx-audio's native generate(ref_audio=, ref_text=) ICL path is the right call. A few things I noticed:
SSRF risk in _decode_ref_audio
urllib.request.urlretrieve will fetch any URL the user provides, including internal network addresses (http://169.254.169.254/, http://localhost:...). If the server is only used locally this is low risk, but worth considering if anyone exposes it on a network. Could restrict to validated external URLs or drop URL support and only accept base64.
Request body size limit
The audio file upload endpoints (/v1/audio/transcriptions, /v1/audio/process) have MAX_AUDIO_UPLOAD_BYTES protection via _read_upload(), but the JSON-based /v1/audio/speech endpoint has no equivalent body size limit. With ref_audio now carrying full audio files as base64, it might be worth adding a size check before base64.b64decode — something like rejecting payloads over a reasonable threshold. This isn't specific to your PR (the gap existed before), but since you're adding the first large-binary-in-JSON field, it would be a good place to address it.
15s truncation only works for WAV
_truncate_ref_audio catches wave.Error for non-WAV formats and skips truncation. If someone sends a long mp3/ogg as ref_audio, it would still overflow the ICL context. Could use mlx_audio.codec.load_audio (or similar) to handle more formats, or at least document the WAV-only limitation.
Minor
ref_text or ""convertsNoneto empty string — some models might treat these differently- Temp file suffix is always
.wavregardless of actual format — probably fine sinceload_audiodetects from content, but worth noting
|
Hi! Thanks for the great implementation. I'm really excited to try out this voice cloning feature. Any updates on when this PR might be merged? |
Summary
ref_audio(base64 data URI / URL) andref_textparameters to/v1/audio/speechendpointgenerate()ICL path instead of requiring a separategenerate_voice_clone()methodTest plan