Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

worse performance of large model compared to small model? #54

Open
XiaoshanHsj opened this issue Nov 15, 2024 · 4 comments
Open

worse performance of large model compared to small model? #54

XiaoshanHsj opened this issue Nov 15, 2024 · 4 comments

Comments

@XiaoshanHsj
Copy link

Thank you for doing such great work and open-sourcing it.

I use the large model (WavTokenizer-large-320-24k-4096) to reconstruct audio of LibriTTS.
However, the results are worse than those reported in paper, which used the small model.

It is
UTMOS_raw 19604.11721920967 4.056303997353543
UTMOS_encodec 19604.11721920967 3.8397375189096272
PESQ: 9956.64894938469 2.060138412866685
F1_score: 4432.935466635334 0.917602042358794 2
STOI: 0.8924008398453133

While in paper, it is
UTMOS_encodec 4.0486
PESQ 2.3730
STOI 0.9139

Is it exceptd for the performance to degrade?

Thanks~

@XiaoshanHsj
Copy link
Author

the test set is test-clean of LibriTTS, and the number of samples is 4833

@jishengpeng
Copy link
Owner

Thank you for doing such great work and open-sourcing it.

I use the large model (WavTokenizer-large-320-24k-4096) to reconstruct audio of LibriTTS. However, the results are worse than those reported in paper, which used the small model.

It is UTMOS_raw 19604.11721920967 4.056303997353543 UTMOS_encodec 19604.11721920967 3.8397375189096272 PESQ: 9956.64894938469 2.060138412866685 F1_score: 4432.935466635334 0.917602042358794 2 STOI: 0.8924008398453133

While in paper, it is UTMOS_encodec 4.0486 PESQ 2.3730 STOI 0.9139

Is it exceptd for the performance to degrade?

Thanks~

Due to the significant increase in generalization capabilities of large models, I observed a slight performance drop on the LibriTTS test-clean dataset (though the difference is minimal). However, your results may also be influenced by other factors, such as cuda version, and it seems that four entries are missing from your test set. Moreover, subject evaluation may be also important. Thank you~

@XiaoshanHsj
Copy link
Author

thanks for your reply, I am using the small model to reconstruct the wavforms.
The results are:

UTMOS_raw 19604.11721920967 4.056303997353543
UTMOS_encodec 19604.11721920967 3.9794073770832084
PESQ: 11974.47469329834 2.477648395054488
F1_score: 4487.17120589376 0.9290209536011925 3
STOI: 0.9199737990446866

@jishengpeng
Copy link
Owner

thanks for your reply, I am using the small model to reconstruct the wavforms. The results are:

UTMOS_raw 19604.11721920967 4.056303997353543 UTMOS_encodec 19604.11721920967 3.9794073770832084 PESQ: 11974.47469329834 2.477648395054488 F1_score: 4487.17120589376 0.9290209536011925 3 STOI: 0.9199737990446866

ok, It appears that the results exhibit some variation about different metrics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants