diff --git a/404/index.html b/404/index.html index b80de1bc..38a33f01 100644 --- a/404/index.html +++ b/404/index.html @@ -291,7 +291,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/Code-Mixing-Metrics/index.html b/Code-Mixing-Metrics/index.html index c442d6af..83641260 100644 --- a/Code-Mixing-Metrics/index.html +++ b/Code-Mixing-Metrics/index.html @@ -296,7 +296,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/Code-Mixing-Seminar/index.html b/Code-Mixing-Seminar/index.html index 00475f15..7ef788b5 100644 --- a/Code-Mixing-Seminar/index.html +++ b/Code-Mixing-Seminar/index.html @@ -296,7 +296,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/Turn_Taking_Dynamics_in_Voice_Bots/index.html b/Turn_Taking_Dynamics_in_Voice_Bots/index.html index 5be4cb8c..3a1a263d 100644 --- a/Turn_Taking_Dynamics_in_Voice_Bots/index.html +++ b/Turn_Taking_Dynamics_in_Voice_Bots/index.html @@ -294,7 +294,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/about/index.html b/about/index.html index 2beef03e..84e4b1e7 100644 --- a/about/index.html +++ b/about/index.html @@ -291,7 +291,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/authentication-in-grpc/index.html b/authentication-in-grpc/index.html index 773a8063..d1a6e226 100644 --- a/authentication-in-grpc/index.html +++ b/authentication-in-grpc/index.html @@ -296,7 +296,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/authors-list/index.html b/authors-list/index.html index b806271a..e44120dd 100644 --- a/authors-list/index.html +++ b/authors-list/index.html @@ -291,7 +291,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/authors/Shahid/index.html b/authors/Shahid/index.html index bedf6752..47f2eac0 100644 --- a/authors/Shahid/index.html +++ b/authors/Shahid/index.html @@ -291,7 +291,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/authors/Shangeth/index.html b/authors/Shangeth/index.html index 061db775..7ee70f80 100644 --- a/authors/Shangeth/index.html +++ b/authors/Shangeth/index.html @@ -291,7 +291,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/authors/Shashank/index.html b/authors/Shashank/index.html index 7ba40f35..0e00de8c 100644 --- a/authors/Shashank/index.html +++ b/authors/Shashank/index.html @@ -291,7 +291,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/authors/anirudhdagar/index.html b/authors/anirudhdagar/index.html index efbb4995..5ea23a90 100644 --- a/authors/anirudhdagar/index.html +++ b/authors/anirudhdagar/index.html @@ -291,7 +291,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/authors/deepankar/index.html b/authors/deepankar/index.html index 89060a6a..11ddeb63 100644 --- a/authors/deepankar/index.html +++ b/authors/deepankar/index.html @@ -291,7 +291,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/authors/greed2411/index.html b/authors/greed2411/index.html index 3f40d56c..b08b2887 100644 --- a/authors/greed2411/index.html +++ b/authors/greed2411/index.html @@ -291,7 +291,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/authors/janaab11/index.html b/authors/janaab11/index.html index 36dcf36f..750d8465 100644 --- a/authors/janaab11/index.html +++ b/authors/janaab11/index.html @@ -291,7 +291,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/authors/kritianandan/index.html b/authors/kritianandan/index.html index 7d430a34..7e262ea4 100644 --- a/authors/kritianandan/index.html +++ b/authors/kritianandan/index.html @@ -291,7 +291,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/authors/lepisma/index.html b/authors/lepisma/index.html index 9c2d14dc..b467f8ba 100644 --- a/authors/lepisma/index.html +++ b/authors/lepisma/index.html @@ -291,7 +291,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/authors/mithun/index.html b/authors/mithun/index.html index afe2642b..832082be 100644 --- a/authors/mithun/index.html +++ b/authors/mithun/index.html @@ -291,7 +291,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/authors/ojus1/index.html b/authors/ojus1/index.html index ea2ada59..683ff687 100644 --- a/authors/ojus1/index.html +++ b/authors/ojus1/index.html @@ -291,7 +291,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/authors/prabhsimran/index.html b/authors/prabhsimran/index.html index be51f353..be8c7b44 100644 --- a/authors/prabhsimran/index.html +++ b/authors/prabhsimran/index.html @@ -291,7 +291,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/authors/sanchit-ahuja/index.html b/authors/sanchit-ahuja/index.html index a4d47a0f..fe7621d3 100644 --- a/authors/sanchit-ahuja/index.html +++ b/authors/sanchit-ahuja/index.html @@ -291,7 +291,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/authors/shantanu28sharma/index.html b/authors/shantanu28sharma/index.html index d8cd538d..bfbf2427 100644 --- a/authors/shantanu28sharma/index.html +++ b/authors/shantanu28sharma/index.html @@ -291,7 +291,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/authors/shikharmn/index.html b/authors/shikharmn/index.html index 1466f48c..4b2ae41e 100644 --- a/authors/shikharmn/index.html +++ b/authors/shikharmn/index.html @@ -291,7 +291,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/authors/swarajdalmia/index.html b/authors/swarajdalmia/index.html index 0f0c40f5..ef4c3082 100644 --- a/authors/swarajdalmia/index.html +++ b/authors/swarajdalmia/index.html @@ -291,7 +291,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/bad-audio-detection/index.html b/bad-audio-detection/index.html index 1306f0cc..c2da443a 100644 --- a/bad-audio-detection/index.html +++ b/bad-audio-detection/index.html @@ -296,7 +296,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/buy-me-a-coffee/index.html b/buy-me-a-coffee/index.html index aa3e5024..150bac92 100644 --- a/buy-me-a-coffee/index.html +++ b/buy-me-a-coffee/index.html @@ -291,7 +291,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/careers/index.html b/careers/index.html index 0205b491..edd47b35 100644 --- a/careers/index.html +++ b/careers/index.html @@ -291,7 +291,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/categories/index.html b/categories/index.html index a272c066..80d7d61f 100644 --- a/categories/index.html +++ b/categories/index.html @@ -291,7 +291,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/complexity-of-conversations/index.html b/complexity-of-conversations/index.html index 2525700f..d419c4c3 100644 --- a/complexity-of-conversations/index.html +++ b/complexity-of-conversations/index.html @@ -296,7 +296,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/confidence-calibration/index.html b/confidence-calibration/index.html index 4a235a34..57e97e42 100644 --- a/confidence-calibration/index.html +++ b/confidence-calibration/index.html @@ -294,7 +294,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/contact/index.html b/contact/index.html index b7aafbfe..551fa98d 100644 --- a/contact/index.html +++ b/contact/index.html @@ -291,7 +291,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/contextual-slu/index.html b/contextual-slu/index.html index d830cdde..483f9228 100644 --- a/contextual-slu/index.html +++ b/contextual-slu/index.html @@ -294,7 +294,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/emnlp/index.html b/emnlp/index.html index 9f002b0a..ca1784d9 100644 --- a/emnlp/index.html +++ b/emnlp/index.html @@ -296,7 +296,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/end-of-utterance-detection/index.html b/end-of-utterance-detection/index.html index 9924b427..0096bead 100644 --- a/end-of-utterance-detection/index.html +++ b/end-of-utterance-detection/index.html @@ -296,7 +296,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/engineering/index.html b/engineering/index.html index e3850962..ef22fa5b 100644 --- a/engineering/index.html +++ b/engineering/index.html @@ -291,7 +291,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/evaluating-an-asr-in-a-spoken-dialogue-system/index.html b/evaluating-an-asr-in-a-spoken-dialogue-system/index.html index c8ca279a..313c97b3 100644 --- a/evaluating-an-asr-in-a-spoken-dialogue-system/index.html +++ b/evaluating-an-asr-in-a-spoken-dialogue-system/index.html @@ -296,7 +296,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/explore/emotional-tts/index.html b/explore/emotional-tts/index.html index 6dac47b1..f3589430 100644 --- a/explore/emotional-tts/index.html +++ b/explore/emotional-tts/index.html @@ -291,7 +291,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/explore/index.html b/explore/index.html index 04d6c1ea..842974e5 100644 --- a/explore/index.html +++ b/explore/index.html @@ -291,7 +291,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/explore/natural-tts/index.html b/explore/natural-tts/index.html index 80cc9565..3e025e10 100644 --- a/explore/natural-tts/index.html +++ b/explore/natural-tts/index.html @@ -291,7 +291,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/explore/speaker-entrainment/index.html b/explore/speaker-entrainment/index.html index 991ebb54..733b80b0 100644 --- a/explore/speaker-entrainment/index.html +++ b/explore/speaker-entrainment/index.html @@ -291,7 +291,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/explore/voice-cloning/index.html b/explore/voice-cloning/index.html index 7e95393b..59dc32e0 100644 --- a/explore/voice-cloning/index.html +++ b/explore/voice-cloning/index.html @@ -291,7 +291,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/fast-microservices-with-grpc/index.html b/fast-microservices-with-grpc/index.html index 2e8c2d65..9d683f4d 100644 --- a/fast-microservices-with-grpc/index.html +++ b/fast-microservices-with-grpc/index.html @@ -296,7 +296,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/feature-disentanglement1/index.html b/feature-disentanglement1/index.html index dc7aeeb8..7388f63e 100644 --- a/feature-disentanglement1/index.html +++ b/feature-disentanglement1/index.html @@ -294,7 +294,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/feed.xml b/feed.xml index d76a0b67..b38b583c 100644 --- a/feed.xml +++ b/feed.xml @@ -1,4 +1,4 @@ -Jekyll2024-05-09T13:16:40+00:00/feed.xmlSkit TechSpeech Technology from SkitSpeech LLMs for Conversations2024-05-09T00:00:00+00:002024-05-09T00:00:00+00:00/speech-conversational-llms<p>With LLMs making conversational systems has become easier. You no longer need to +Jekyll2024-05-09T17:53:21+00:00/feed.xmlSkit TechSpeech Technology from SkitSpeech LLMs for Conversations2024-05-09T00:00:00+00:002024-05-09T00:00:00+00:00/speech-conversational-llms<p>With LLMs making conversational systems has become easier. You no longer need to focus on the low-level details of categorizing semantics and designing responses. Instead, you can concentrate on controlling high-level behaviors via an LLM. This is the trend that we see most of the world moving towards as @@ -9,7 +9,7 @@ come.</p> <p><a href="/speech-first-conversational-ai-revisited/">Earlier</a> we discussed how spoken conversations are richer than pure text and how the gap would be not bridged by -LLMs purely working on transcriptions. In one of our recent experiments we build +LLMs purely working on transcriptions. In one of our recent experiments we built an efficient multi-modal LLM that takes speech directly to provide better conversational experience. For production usage, the constraint here is that this should happen without losing the flexibility that you get in a text-only diff --git a/gsoc-2022/index.html b/gsoc-2022/index.html index 84b73648..77dd61f7 100644 --- a/gsoc-2022/index.html +++ b/gsoc-2022/index.html @@ -294,7 +294,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/index.html b/index.html index ea507967..0a64b9a8 100644 --- a/index.html +++ b/index.html @@ -292,7 +292,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/interspeech/index.html b/interspeech/index.html index 3d41225a..1e5a1a57 100644 --- a/interspeech/index.html +++ b/interspeech/index.html @@ -296,7 +296,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/label-noise-intro/index.html b/label-noise-intro/index.html index bce0e1df..90c5db37 100644 --- a/label-noise-intro/index.html +++ b/label-noise-intro/index.html @@ -296,7 +296,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/ml/index.html b/ml/index.html index 0d9226d7..1e99b4ae 100644 --- a/ml/index.html +++ b/ml/index.html @@ -291,7 +291,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/new-blog/index.html b/new-blog/index.html index b5d2872b..ea355407 100644 --- a/new-blog/index.html +++ b/new-blog/index.html @@ -294,7 +294,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/normalizing-flows-part-2/index.html b/normalizing-flows-part-2/index.html index 5b49e904..c2b41ccb 100644 --- a/normalizing-flows-part-2/index.html +++ b/normalizing-flows-part-2/index.html @@ -296,7 +296,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/normalizing-flows/index.html b/normalizing-flows/index.html index d80952d0..5757eedc 100644 --- a/normalizing-flows/index.html +++ b/normalizing-flows/index.html @@ -296,7 +296,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/on-using-asr-alternatives-for-a-better-slu/index.html b/on-using-asr-alternatives-for-a-better-slu/index.html index b892e1e2..4b4b16d1 100644 --- a/on-using-asr-alternatives-for-a-better-slu/index.html +++ b/on-using-asr-alternatives-for-a-better-slu/index.html @@ -296,7 +296,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/page2/index.html b/page2/index.html index 3fdcffda..77523530 100644 --- a/page2/index.html +++ b/page2/index.html @@ -293,7 +293,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/page3/index.html b/page3/index.html index bfca3bcf..139393e9 100644 --- a/page3/index.html +++ b/page3/index.html @@ -292,7 +292,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/privacy-policy/index.html b/privacy-policy/index.html index 5a8a2f38..3d606cf6 100644 --- a/privacy-policy/index.html +++ b/privacy-policy/index.html @@ -291,7 +291,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/reading-sessions/index.html b/reading-sessions/index.html index 04a2fe0c..5f1997a9 100644 --- a/reading-sessions/index.html +++ b/reading-sessions/index.html @@ -296,7 +296,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/repl-conversations/index.html b/repl-conversations/index.html index d857a228..2b853445 100644 --- a/repl-conversations/index.html +++ b/repl-conversations/index.html @@ -296,7 +296,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/resources/index.html b/resources/index.html index 335292b4..8cdd12a2 100644 --- a/resources/index.html +++ b/resources/index.html @@ -291,7 +291,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/speaker-diarization/index.html b/speaker-diarization/index.html index ff07fcf3..badd44ce 100644 --- a/speaker-diarization/index.html +++ b/speaker-diarization/index.html @@ -296,7 +296,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/speaker-entrainment/index.html b/speaker-entrainment/index.html index 4385d7df..0f0fbaa2 100644 --- a/speaker-entrainment/index.html +++ b/speaker-entrainment/index.html @@ -296,7 +296,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/speech-conversational-llms/index.html b/speech-conversational-llms/index.html index b7877ef5..8c2e41a5 100644 --- a/speech-conversational-llms/index.html +++ b/speech-conversational-llms/index.html @@ -294,7 +294,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", @@ -574,7 +574,7 @@

Speech LLMs for Conversations

Earlier we discussed how spoken conversations are richer than pure text and how the gap would be not bridged by -LLMs purely working on transcriptions. In one of our recent experiments we build +LLMs purely working on transcriptions. In one of our recent experiments we built an efficient multi-modal LLM that takes speech directly to provide better conversational experience. For production usage, the constraint here is that this should happen without losing the flexibility that you get in a text-only diff --git a/speech-first-conversational-ai-revisited/index.html b/speech-first-conversational-ai-revisited/index.html index b3c3b749..d43c7f5c 100644 --- a/speech-first-conversational-ai-revisited/index.html +++ b/speech-first-conversational-ai-revisited/index.html @@ -294,7 +294,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/speech-first-conversational-ai/index.html b/speech-first-conversational-ai/index.html index 27eeed26..85b03e34 100644 --- a/speech-first-conversational-ai/index.html +++ b/speech-first-conversational-ai/index.html @@ -294,7 +294,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/tags/index.html b/tags/index.html index 08af385d..6c10decd 100644 --- a/tags/index.html +++ b/tags/index.html @@ -291,7 +291,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/theory-of-mind/index.html b/theory-of-mind/index.html index cac94c48..9b2a7cf3 100644 --- a/theory-of-mind/index.html +++ b/theory-of-mind/index.html @@ -294,7 +294,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/whats-new-kaldi-serve-10/index.html b/whats-new-kaldi-serve-10/index.html index f048e25e..36be6109 100644 --- a/whats-new-kaldi-serve-10/index.html +++ b/whats-new-kaldi-serve-10/index.html @@ -296,7 +296,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/", diff --git a/woc/index.html b/woc/index.html index 9dcf6684..58c70b85 100644 --- a/woc/index.html +++ b/woc/index.html @@ -293,7 +293,7 @@ "id": 37, "url": "/speech-conversational-llms/", "title": "Speech LLMs for Conversations", - "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " + "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production.  ↩ " }, { "id": 38, "url": "/confidence-calibration/",