Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CER Performance of Reconstructed Audio #34

Open
howitry opened this issue Sep 18, 2024 · 6 comments
Open

CER Performance of Reconstructed Audio #34

howitry opened this issue Sep 18, 2024 · 6 comments

Comments

@howitry
Copy link

howitry commented Sep 18, 2024

When using the 40 tokens/s configuration, although the quality of the reconstructed audio is very good, there are often some mispronunciations. Have you measured the CER performance of the reconstructed audio?

@jishengpeng
Copy link
Owner

When using the 40 tokens/s configuration, although the quality of the reconstructed audio is very good, there are often some mispronunciations. Have you measured the CER performance of the reconstructed audio?

The CER or WER results seem satisfactory from our experiments. Could you kindly provide further experimental details, such as whether you used the WavTokenizer-small or WavTokenizer-medium version? Additionally, on which test set were the evaluations conducted? Please note that the WavTokenizer-small version has very limited generalization capability.

@howitry
Copy link
Author

howitry commented Sep 18, 2024

When using the 40 tokens/s configuration, although the quality of the reconstructed audio is very good, there are often some mispronunciations. Have you measured the CER performance of the reconstructed audio?

The CER or WER results seem satisfactory from our experiments. Could you kindly provide further experimental details, such as whether you used the WavTokenizer-small or WavTokenizer-medium version? Additionally, on which test set were the evaluations conducted? Please note that the WavTokenizer-small version has very limited generalization capability.

When using the 40 tokens/s configuration, although the quality of the reconstructed audio is very good, there are often some mispronunciations. Have you measured the CER performance of the reconstructed audio?

The CER or WER results seem satisfactory from our experiments. Could you kindly provide further experimental details, such as whether you used the WavTokenizer-small or WavTokenizer-medium version? Additionally, on which test set were the evaluations conducted? Please note that the WavTokenizer-small version has very limited generalization capability.

I trained the wavtokenizer on about 60,000 hours of data, with a 1:1 ratio of English to Chinese data. I have trained for 3 epochs so far, and when checking the reconstruction of Chinese, I found some incorrect pronunciations.

@jishengpeng
Copy link
Owner

When using the 40 tokens/s configuration, although the quality of the reconstructed audio is very good, there are often some mispronunciations. Have you measured the CER performance of the reconstructed audio?

The CER or WER results seem satisfactory from our experiments. Could you kindly provide further experimental details, such as whether you used the WavTokenizer-small or WavTokenizer-medium version? Additionally, on which test set were the evaluations conducted? Please note that the WavTokenizer-small version has very limited generalization capability.

When using the 40 tokens/s configuration, although the quality of the reconstructed audio is very good, there are often some mispronunciations. Have you measured the CER performance of the reconstructed audio?

The CER or WER results seem satisfactory from our experiments. Could you kindly provide further experimental details, such as whether you used the WavTokenizer-small or WavTokenizer-medium version? Additionally, on which test set were the evaluations conducted? Please note that the WavTokenizer-small version has very limited generalization capability.

I trained the wavtokenizer on about 60,000 hours of data, with a 1:1 ratio of English to Chinese data. I have trained for 3 epochs so far, and when checking the reconstruction of Chinese, I found some incorrect pronunciations.

Training for only three epochs seems insufficient. Since the data is randomly sampled during training, it means that a full pass through the dataset has not yet been completed. Extending the training to 12-24 epochs could potentially yield better results.

@YoungloLee
Copy link

After restoring our own Korean speech data using the WavTokenizer-medium-speech-75token checkpoint and measuring the CER, there was a significant drop in performance. Could you share the CER or WER comparison results you conducted?

In our experiment, we obtained the following results:

  • GroundTruth CER: 4.3978%
  • Reconstructed CER: 11.8222%

@jishengpeng
Copy link
Owner

After restoring our own Korean speech data using the WavTokenizer-medium-speech-75token checkpoint and measuring the CER, there was a significant drop in performance. Could you share the CER or WER comparison results you conducted?

In our experiment, we obtained the following results:

  • GroundTruth CER: 4.3978%
  • Reconstructed CER: 11.8222%

The wavTokenizer-medium-speech model was trained on a very limited amount of Korean data, making this phenomenon reasonable. You may consider testing the WER or CER on the English test set (LibriTTS testclean). Additionally, retraining a version of WavTokenizer with Korean data is likely to yield significantly improved performance.

@hertz-pj
Copy link

@YoungloLee @jishengpeng Could you please share the loss curves for your model trained with 60,000 and 80,000 hours of data?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants