[Feature request] Auralis optimisations of XTTS #181

erew123 · 2024-11-30T18:13:41Z

🚀 Feature Description

Hi @eginhard. Hope you are keeping well!

Its erew123 from AllTalk

Someone has pointed this out to me: https://www.astramind.ai/post/auralis and I think this is the GitHub https://github.com/astramind-ai/Auralis

Its a little beyond my pay grade, but maybe its of interest to the Coqui scripts. I dont know if you have seen this, or the author is posting on here with you, but I thought you may like to see it:

Firing their document into AI for a quick "Here is what they claim"

The author claims to have optimized XTTS-v2, a text-to-speech model, making it faster, more resource-efficient, asynchronous, and safer for production environments. Here are the key points and the performance gains:

What They Did

Understanding the Code and Challenges:
- Overcame a lack of prior experience in audio tech.
- Debugged and worked around outdated dependencies and repos.
Tokenizer Optimization:
- Replaced a custom tokenizer with a Hugging Face-compatible FastPreTrainedTokenizer.
- Improved token splitting logic to maintain audio quality while handling memory-efficient truncation.
Model Reorganization:
- Refactored the original architecture, which used GPT-2-like models and a HiFi-GAN vocoder, to eliminate unnecessary computations during inference.
- Optimized the HiFi-GAN component to use in-place operations, drastically reducing memory usage.
Integration of vLLM for GPT-2:
- Overcame challenges in adapting vLLM for multimodal GPT-2, including token cache management and continuous batching.
- Addressed vLLM's limitations on repetition penalties and hidden state collection, customizing its behavior for audio-specific tasks.
Asynchronous Execution:
- Made components non-blocking using asyncio.
Optimized Workflow:
- Avoided redundant token and embedding calculations during iterative decoding.
- Adapted position ID tracking to align with unique conditioning inputs for multimodal tasks.

Performance Gains

Speed:
- Leveraging vLLM and deduplicating computations significantly reduced inference time.
Resource Efficiency:
- Memory consumption was slashed by optimizing HiFi-GAN for inference.
- Reduced overhead by restructuring the GPT-2 and conditioning modules.
Production Suitability:
- Ensured asynchronous, non-blocking execution for smoother integration into UI frameworks like Pulsar.
- Increased safety by moving from .pth to safer formats and handling positional encoding appropriately.
Accessibility:
- Made the enhancements available to the open-source community for broader adoption.

The overall result is a production-ready, optimized XTTS-v2 that is significantly faster and more memory-efficient, with asynchronous capabilities enabling smoother integration into applications.

Thanks Erew123

The text was updated successfully, but these errors were encountered:

eginhard · 2024-12-01T10:17:16Z

Thanks for sharing! I wasn't aware of it. Will take a look.

erew123 · 2024-12-01T10:19:33Z

@eginhard FYI the requirements they have bump Pytorch to 2.5.1... I have no idea if they are actually using something from that version of PyTorch. Just so you are aware. Thought it interesting though!

Thanks

eginhard changed the title ~~[Feature request]~~ [Feature request] Auralis optimisations of XTTS Dec 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature request] Auralis optimisations of XTTS #181

[Feature request] Auralis optimisations of XTTS #181

erew123 commented Nov 30, 2024

eginhard commented Dec 1, 2024

erew123 commented Dec 1, 2024

[Feature request] Auralis optimisations of XTTS #181

[Feature request] Auralis optimisations of XTTS #181

Comments

erew123 commented Nov 30, 2024

What They Did

Performance Gains

eginhard commented Dec 1, 2024

erew123 commented Dec 1, 2024