Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About ASR #6

Open
wntg opened this issue Sep 2, 2024 · 4 comments
Open

About ASR #6

wntg opened this issue Sep 2, 2024 · 4 comments
Labels
good question the valuable question

Comments

@wntg
Copy link

wntg commented Sep 2, 2024

Thanks for your excellent work!
I want to ask how the Discrete tokenizer's perform on the ASR?Can you tell me your understand? Thanks!

@jishengpeng
Copy link
Owner

jishengpeng commented Sep 3, 2024

Thanks for your excellent work! I want to ask how the Discrete tokenizer's perform on the ASR?Can you tell me your understand? Thanks!

The most relevant aspect of discrete tokenizers in Automatic Speech Recognition (ASR) tasks is demonstrated in the experiments on the ARCH benchmark using WavTokenizer, as presented in the paper.

  1. In ASR tasks, the primary focus is on the textual content of the speech modality, whereas the inherent acoustic information is not emphasized. Therefore, utilizing semantic tokenizers like Whisper is sufficient for ASR tasks. If discretization is required, directly discretizing semantic tokens is more suitable for ASR tasks. Employing acoustic tokenizers solely for ASR tasks is not ideal. Notably, in the context of end-to-end speech dialogue systems like GPT4-o, the input is not merely the text output from an ASR model. Instead, it requires additional information inherent to the speech modality, such as the speaker's emotions, tone, and style. Consequently, acoustic tokens have a broader application scope in such multi-task large models compared to semantic tokens. Furthermore, in future multi-modal unified large models, acoustic tokenizers can better distinguish themselves from other modalities, representing the speech modality itself.

  2. Regarding semantic tokens in ASR tasks, although numerous efforts have been made to enhance semantic information in acoustic tokenizers, even those that compromise audio and music modeling capabilities for semantic modeling capabilities, their semantic information content remains inferior to the best semantic tokenizers. This is even more pronounced when compared to more elegant semantic enhancement methods, such as WavTokenizer, which is weaker than encoder-based models like HuBERT.

  3. However, we believe that acoustic tokenizers have potential and may eventually match the semantic information content of encoder-based models like HuBERT. The key lies in elegantly enhancing the encoder's capabilities. Notably, in WavTokenizer, we significantly strengthened the decoder, but the encoder is also crucial. One of our objectives is to substantially enhance the encoder's capabilities in WavTokenizer 2, thereby further improving semantic modeling capabilities.

@jishengpeng jishengpeng added the good question the valuable question label Sep 3, 2024
@jishengpeng jishengpeng pinned this issue Sep 3, 2024
@wntg
Copy link
Author

wntg commented Sep 3, 2024

Thank you for your detailed answer. I feel the same way. I hope that one day in the future tokenizer will have good expression both acoustically and semantically.
In addition, I have other questions. In the current work, discrete encoding can generate realistic audio. These speeches have textual content, but why can't they express semantics well? I suspect this is biased learning. But that's not necessarily a bad thing. For example, a child may learn to speak first, but may not be able to write. Regarding the deep learnning task, can we use discrete coding to directly conduct voice dialogues skipping the process of converting text?

@jishengpeng
Copy link
Owner

Thank you for your detailed answer. I feel the same way. I hope that one day in the future tokenizer will have good expression both acoustically and semantically. In addition, I have other questions. In the current work, discrete encoding can generate realistic audio. These speeches have textual content, but why can't they express semantics well? I suspect this is biased learning. But that's not necessarily a bad thing. For example, a child may learn to speak first, but may not be able to write. Regarding the deep learnning task, can we use discrete coding to directly conduct voice dialogues skipping the process of converting text?

Regarding the two new questions raised, our perspectives are as follows:

  1. On the question of why the reconstruction quality is good but the semantic content is not particularly rich, there are a few possible explanations. One is that the encoder of the acoustic codec may have multiple training objectives(semantic and acoustic), so even if it contains semantic information, there could be issues with information fusion and interference. Another factor is that Whisper uses a single encoder structure, whereas codec models derive their strong reconstruction capability from the combined encoder-decoder architecture. The current strong reconstruction performance is thus most likely attributable to the powerful decoder, so future efforts should focus on further strengthening the encoder.
  2. For the GPT-4o dialogue system and subsequent multimodal large language models, my personal view is that the developmental trajectory will involve a progression from ASR + LLM + TTS in a text-only cascade, then transitioning to an implicit cascade using latent embeddings, and finally moving towards a direct end-to-end approach utilizing codec representations extracted by a WavTokenizer as input to generate target codec outputs. The ultimate goal would be to leverage the inherent tokenizers of the various modalities to enable truly end-to-end training across arbitrary modalities.

@wntg
Copy link
Author

wntg commented Sep 4, 2024

Thanks for your answer, I learned a lot. I agree with you very much. Regarding the second point, I would like to try to do end-to-end research using encoders such as WavTokenizer. I also look forward to your follow-up work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good question the valuable question
Projects
None yet
Development

No branches or pull requests

2 participants