Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about more detailed experimental results #2

Open
hbwu-ntu opened this issue Aug 30, 2024 · 2 comments
Open

Questions about more detailed experimental results #2

hbwu-ntu opened this issue Aug 30, 2024 · 2 comments
Labels
good question the valuable question

Comments

@hbwu-ntu
Copy link

Hi, @jishengpeng thank you for the amazing work. May I ask several questions:

  1. What are the results for large and medium models? Currently, there are only small-model results in the paper
  2. Do you have some ablation study to show the performance gain by incorporating the attention block?
  3. Do you have an ablation study to show the performance gain by changing the decoder similar to VOCOS?
  4. Will you compare your codec model with Single-codec or Ti-Codec? It's hard to compare with Single-codec as it is not open-source. But Ti-codec is open-source. Will you include it in the comparison?
  5. Do you consider the human evaluation, as the current trends between UTMOS and PESQ (STOI) are not consistent? UTMOS is somehow a proxy for human listening, just like DNSMOS. But they are not accurate enough. PESQ and STOI are also good proxies for human listening.
@jishengpeng
Copy link
Owner

Hi, @jishengpeng thank you for the amazing work. May I ask several questions:

  1. What are the results for large and medium models? Currently, there are only small-model results in the paper
  2. Do you have some ablation study to show the performance gain by incorporating the attention block?
  3. Do you have an ablation study to show the performance gain by changing the decoder similar to VOCOS?
  4. Will you compare your codec model with Single-codec or Ti-Codec? It's hard to compare with Single-codec as it is not open-source. But Ti-codec is open-source. Will you include it in the comparison?
  5. Do you consider the human evaluation, as the current trends between UTMOS and PESQ (STOI) are not consistent? UTMOS is somehow a proxy for human listening, just like DNSMOS. But they are not accurate enough. PESQ and STOI are also good proxies for human listening.

Thank you very much for your interest!

  1. The experiments for the medium and large versions have not been completed due to resource constraints; we are still in the training phase. However, based on the current results, it appears that the medium and large versions will exhibit significantly better generalization in codec reconstruction.
  2. We conducted numerous ablation studies regarding the attention blocks (adding attention blocks at various positions in the encoder and decoder). Negative effects were observed in certain positions. The approach presented in the paper has proven beneficial across hundreds of test samples, but we have yet to rigorously validate it on thousands. Thus, we will include additional experiments in future versions.
  3. Regarding the Vocos decoder, we attempted to replace it with an inverted upsampling structure, but the results were poor. Similar to the previous point, we have established the effectiveness of Vocos, yet we plan to supplement our findings with stricter experiments on thousands of test samples.
  4. According to the results presented in the paper, the UTMOS score for Single-Codec is only 3.0, which is why we did not perform a comparison. The two-encoder approach of Ti-Codec is not particularly elegant, and its performance appears inferior to DAC. We may supplement the Ti-Codec results in a later version.
  5. After listening to a large number of samples, we found that UTMOS results are closer to human auditory perception. PESQ and SOTI are good testing metrics; however, they are not sensitive to certain noise artifacts. It would be ideal to include a subjective metric as well. Although we did not conduct a crowdsourced evaluation, based on my personal listening experience, the results from WavTokenizer are satisfactory

Best regards.

@jishengpeng jishengpeng pinned this issue Aug 30, 2024
@jishengpeng jishengpeng added good first issue Good for newcomers good question the valuable question and removed good first issue Good for newcomers labels Aug 30, 2024
@hbwu-ntu
Copy link
Author

Thank you for the answer! Glad to see more numbers in the upcoming Arxiv version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good question the valuable question
Projects
None yet
Development

No branches or pull requests

2 participants