I tried to run the 1.5B multi-speaker model on a 12GB GPU using a script longer than 10 minutes.
Expected behavior:
- The model generates audio for the entire script.
Actual behavior:
- The process crashes with a CUDA Out of Memory (OOM) error after a few minutes.
Steps to reproduce:
- Clone the VibeVoice repository.
- Install dependencies as per the instructions.
- Run inference with the 1.5B multi-speaker model on a script longer than 10 minutes.
Environment:
- OS: Ubuntu 22.04
- Python: 3.11
- CUDA: 12.1
- GPU: NVIDIA RTX 3060 12GB
- VibeVoice model: 1.5B multi-speaker
Additional notes:
- Reducing the script length allows the inference to succeed.
- Suggestion: maybe implement memory optimization for long-form audio generation.
I tried to run the 1.5B multi-speaker model on a 12GB GPU using a script longer than 10 minutes.
Expected behavior:
Actual behavior:
Steps to reproduce:
Environment:
Additional notes: