You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Does any train this model and using it to train LLM-based TTS. How about the performace?
I mean performance of wav quanlity, as well as performace in zero-shot-TTS.
The text was updated successfully, but these errors were encountered:
Liujingxiu23
changed the title
performace in LLM-based-TTS
Performance in LLM-based-TTS
Sep 26, 2024
Does any train this model and using it to train LLM-based TTS. How about the performace? I mean performance of wav quanlity, as well as performace in zero-shot-TTS.
We found that, under fair comparison conditions, the speech synthesis quality of a single-layer WavTokenizer outperforms that of the 9-layer DAC in downstream autoregressive TTS models, with slight improvements in other text-to-speech aspects as well
But I also meet the same problem as #34. There are mispronunciation in reconstruction wave, phones may sounds like its similar phones. Training more (3 epochs -> 5 epochs) seems not alleviate this problem. Do you have any other idea to slove this problem?
And do you think hifigan maybe a better model for decoder part?
Does any train this model and using it to train LLM-based TTS. How about the performace?
I mean performance of wav quanlity, as well as performace in zero-shot-TTS.
The text was updated successfully, but these errors were encountered: