<p>With LLMs making conversational systems has become easier. You no longer need to
focus on the low-level details of categorizing semantics and designing
responses. Instead, you can concentrate on controlling high-level behaviors via
an LLM. This is the trend that we see most of the world moving towards as
@@ -9,7 +9,7 @@ come.</p>
<p><a href="/speech-first-conversational-ai-revisited/">Earlier</a> we discussed how spoken
conversations are richer than pure text and how the gap would be not bridged by
-LLMs purely working on transcriptions. In one of our recent experiments we build
+LLMs purely working on transcriptions. In one of our recent experiments we built
an efficient multi-modal LLM that takes speech directly to provide better
conversational experience. For production usage, the constraint here is that
this should happen without losing the flexibility that you get in a text-only
diff --git a/gsoc-2022/index.html b/gsoc-2022/index.html
index 84b73648..77dd61f7 100644
--- a/gsoc-2022/index.html
+++ b/gsoc-2022/index.html
@@ -294,7 +294,7 @@
"id": 37,
"url": "/speech-conversational-llms/",
"title": "Speech LLMs for Conversations",
- "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
+ "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
}, {
"id": 38,
"url": "/confidence-calibration/",
diff --git a/index.html b/index.html
index ea507967..0a64b9a8 100644
--- a/index.html
+++ b/index.html
@@ -292,7 +292,7 @@
"id": 37,
"url": "/speech-conversational-llms/",
"title": "Speech LLMs for Conversations",
- "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
+ "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
}, {
"id": 38,
"url": "/confidence-calibration/",
diff --git a/interspeech/index.html b/interspeech/index.html
index 3d41225a..1e5a1a57 100644
--- a/interspeech/index.html
+++ b/interspeech/index.html
@@ -296,7 +296,7 @@
"id": 37,
"url": "/speech-conversational-llms/",
"title": "Speech LLMs for Conversations",
- "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
+ "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
}, {
"id": 38,
"url": "/confidence-calibration/",
diff --git a/label-noise-intro/index.html b/label-noise-intro/index.html
index bce0e1df..90c5db37 100644
--- a/label-noise-intro/index.html
+++ b/label-noise-intro/index.html
@@ -296,7 +296,7 @@
"id": 37,
"url": "/speech-conversational-llms/",
"title": "Speech LLMs for Conversations",
- "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
+ "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
}, {
"id": 38,
"url": "/confidence-calibration/",
diff --git a/ml/index.html b/ml/index.html
index 0d9226d7..1e99b4ae 100644
--- a/ml/index.html
+++ b/ml/index.html
@@ -291,7 +291,7 @@
"id": 37,
"url": "/speech-conversational-llms/",
"title": "Speech LLMs for Conversations",
- "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
+ "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
}, {
"id": 38,
"url": "/confidence-calibration/",
diff --git a/new-blog/index.html b/new-blog/index.html
index b5d2872b..ea355407 100644
--- a/new-blog/index.html
+++ b/new-blog/index.html
@@ -294,7 +294,7 @@
"id": 37,
"url": "/speech-conversational-llms/",
"title": "Speech LLMs for Conversations",
- "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
+ "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
}, {
"id": 38,
"url": "/confidence-calibration/",
diff --git a/normalizing-flows-part-2/index.html b/normalizing-flows-part-2/index.html
index 5b49e904..c2b41ccb 100644
--- a/normalizing-flows-part-2/index.html
+++ b/normalizing-flows-part-2/index.html
@@ -296,7 +296,7 @@
"id": 37,
"url": "/speech-conversational-llms/",
"title": "Speech LLMs for Conversations",
- "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
+ "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
}, {
"id": 38,
"url": "/confidence-calibration/",
diff --git a/normalizing-flows/index.html b/normalizing-flows/index.html
index d80952d0..5757eedc 100644
--- a/normalizing-flows/index.html
+++ b/normalizing-flows/index.html
@@ -296,7 +296,7 @@
"id": 37,
"url": "/speech-conversational-llms/",
"title": "Speech LLMs for Conversations",
- "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
+ "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
}, {
"id": 38,
"url": "/confidence-calibration/",
diff --git a/on-using-asr-alternatives-for-a-better-slu/index.html b/on-using-asr-alternatives-for-a-better-slu/index.html
index b892e1e2..4b4b16d1 100644
--- a/on-using-asr-alternatives-for-a-better-slu/index.html
+++ b/on-using-asr-alternatives-for-a-better-slu/index.html
@@ -296,7 +296,7 @@
"id": 37,
"url": "/speech-conversational-llms/",
"title": "Speech LLMs for Conversations",
- "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
+ "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
}, {
"id": 38,
"url": "/confidence-calibration/",
diff --git a/page2/index.html b/page2/index.html
index 3fdcffda..77523530 100644
--- a/page2/index.html
+++ b/page2/index.html
@@ -293,7 +293,7 @@
"id": 37,
"url": "/speech-conversational-llms/",
"title": "Speech LLMs for Conversations",
- "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
+ "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
}, {
"id": 38,
"url": "/confidence-calibration/",
diff --git a/page3/index.html b/page3/index.html
index bfca3bcf..139393e9 100644
--- a/page3/index.html
+++ b/page3/index.html
@@ -292,7 +292,7 @@
"id": 37,
"url": "/speech-conversational-llms/",
"title": "Speech LLMs for Conversations",
- "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
+ "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
}, {
"id": 38,
"url": "/confidence-calibration/",
diff --git a/privacy-policy/index.html b/privacy-policy/index.html
index 5a8a2f38..3d606cf6 100644
--- a/privacy-policy/index.html
+++ b/privacy-policy/index.html
@@ -291,7 +291,7 @@
"id": 37,
"url": "/speech-conversational-llms/",
"title": "Speech LLMs for Conversations",
- "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
+ "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
}, {
"id": 38,
"url": "/confidence-calibration/",
diff --git a/reading-sessions/index.html b/reading-sessions/index.html
index 04a2fe0c..5f1997a9 100644
--- a/reading-sessions/index.html
+++ b/reading-sessions/index.html
@@ -296,7 +296,7 @@
"id": 37,
"url": "/speech-conversational-llms/",
"title": "Speech LLMs for Conversations",
- "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
+ "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
}, {
"id": 38,
"url": "/confidence-calibration/",
diff --git a/repl-conversations/index.html b/repl-conversations/index.html
index d857a228..2b853445 100644
--- a/repl-conversations/index.html
+++ b/repl-conversations/index.html
@@ -296,7 +296,7 @@
"id": 37,
"url": "/speech-conversational-llms/",
"title": "Speech LLMs for Conversations",
- "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
+ "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
}, {
"id": 38,
"url": "/confidence-calibration/",
diff --git a/resources/index.html b/resources/index.html
index 335292b4..8cdd12a2 100644
--- a/resources/index.html
+++ b/resources/index.html
@@ -291,7 +291,7 @@
"id": 37,
"url": "/speech-conversational-llms/",
"title": "Speech LLMs for Conversations",
- "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
+ "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
}, {
"id": 38,
"url": "/confidence-calibration/",
diff --git a/speaker-diarization/index.html b/speaker-diarization/index.html
index ff07fcf3..badd44ce 100644
--- a/speaker-diarization/index.html
+++ b/speaker-diarization/index.html
@@ -296,7 +296,7 @@
"id": 37,
"url": "/speech-conversational-llms/",
"title": "Speech LLMs for Conversations",
- "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
+ "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
}, {
"id": 38,
"url": "/confidence-calibration/",
diff --git a/speaker-entrainment/index.html b/speaker-entrainment/index.html
index 4385d7df..0f0fbaa2 100644
--- a/speaker-entrainment/index.html
+++ b/speaker-entrainment/index.html
@@ -296,7 +296,7 @@
"id": 37,
"url": "/speech-conversational-llms/",
"title": "Speech LLMs for Conversations",
- "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
+ "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
}, {
"id": 38,
"url": "/confidence-calibration/",
diff --git a/speech-conversational-llms/index.html b/speech-conversational-llms/index.html
index b7877ef5..8c2e41a5 100644
--- a/speech-conversational-llms/index.html
+++ b/speech-conversational-llms/index.html
@@ -294,7 +294,7 @@
"id": 37,
"url": "/speech-conversational-llms/",
"title": "Speech LLMs for Conversations",
- "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
+ "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
}, {
"id": 38,
"url": "/confidence-calibration/",
@@ -574,7 +574,7 @@ Speech LLMs for Conversations
Earlier we discussed how spoken
conversations are richer than pure text and how the gap would be not bridged by
-LLMs purely working on transcriptions. In one of our recent experiments we build
+LLMs purely working on transcriptions. In one of our recent experiments we built
an efficient multi-modal LLM that takes speech directly to provide better
conversational experience. For production usage, the constraint here is that
this should happen without losing the flexibility that you get in a text-only
diff --git a/speech-first-conversational-ai-revisited/index.html b/speech-first-conversational-ai-revisited/index.html
index b3c3b749..d43c7f5c 100644
--- a/speech-first-conversational-ai-revisited/index.html
+++ b/speech-first-conversational-ai-revisited/index.html
@@ -294,7 +294,7 @@
"id": 37,
"url": "/speech-conversational-llms/",
"title": "Speech LLMs for Conversations",
- "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
+ "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
}, {
"id": 38,
"url": "/confidence-calibration/",
diff --git a/speech-first-conversational-ai/index.html b/speech-first-conversational-ai/index.html
index 27eeed26..85b03e34 100644
--- a/speech-first-conversational-ai/index.html
+++ b/speech-first-conversational-ai/index.html
@@ -294,7 +294,7 @@
"id": 37,
"url": "/speech-conversational-llms/",
"title": "Speech LLMs for Conversations",
- "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
+ "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
}, {
"id": 38,
"url": "/confidence-calibration/",
diff --git a/tags/index.html b/tags/index.html
index 08af385d..6c10decd 100644
--- a/tags/index.html
+++ b/tags/index.html
@@ -291,7 +291,7 @@
"id": 37,
"url": "/speech-conversational-llms/",
"title": "Speech LLMs for Conversations",
- "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
+ "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
}, {
"id": 38,
"url": "/confidence-calibration/",
diff --git a/theory-of-mind/index.html b/theory-of-mind/index.html
index cac94c48..9b2a7cf3 100644
--- a/theory-of-mind/index.html
+++ b/theory-of-mind/index.html
@@ -294,7 +294,7 @@
"id": 37,
"url": "/speech-conversational-llms/",
"title": "Speech LLMs for Conversations",
- "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
+ "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
}, {
"id": 38,
"url": "/confidence-calibration/",
diff --git a/whats-new-kaldi-serve-10/index.html b/whats-new-kaldi-serve-10/index.html
index f048e25e..36be6109 100644
--- a/whats-new-kaldi-serve-10/index.html
+++ b/whats-new-kaldi-serve-10/index.html
@@ -296,7 +296,7 @@
"id": 37,
"url": "/speech-conversational-llms/",
"title": "Speech LLMs for Conversations",
- "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
+ "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
}, {
"id": 38,
"url": "/confidence-calibration/",
diff --git a/woc/index.html b/woc/index.html
index 9dcf6684..58c70b85 100644
--- a/woc/index.html
+++ b/woc/index.html
@@ -293,7 +293,7 @@
"id": 37,
"url": "/speech-conversational-llms/",
"title": "Speech LLMs for Conversations",
- "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we buildan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
+ "body": "2024/05/09 - With LLMs making conversational systems has become easier. You no longer need tofocus on the low-level details of categorizing semantics and designingresponses. Instead, you can concentrate on controlling high-level behaviors viaan LLM. This is the trend that we see most of the world moving towards asproducts are using vendor combinations of ASR, LLM, and TTS with some dialogmanagement stitched in between. While this is going to be the norm soon, we wantto keep exploring areas from where the next set of quality improvements willcome. Earlier we discussed how spokenconversations are richer than pure text and how the gap would be not bridged byLLMs purely working on transcriptions. In one of our recent experiments we builtan efficient multi-modal LLM that takes speech directly to provide betterconversational experience. For production usage, the constraint here is thatthis should happen without losing the flexibility that you get in a text-onlyLLM around writing prompts, making changes, evaluating, and debugging. Below is a conversation with our recent in-house Speech LLM based conversationalsystem. Notice that because of the extra information in speech some micropersonalizations can happen like usage of gendered pronouns1. You also getlower impact of transcription errors and in general better responses innon-speech signals. With access to both speech and text domains, the modelallows for more fluent turn-taking, though not demonstrated in the currentconversation. In addition, our approach also reduces the combined model size(<2B) for taking speech to response, leading to lower compute latency ascompared to larger systems. The model above doesn’t yet control speech synthesis beyond the textual markersit can generate, but that’s something to be added soon (you might have noticederratic pitch shifts in the call above since TTS vendors don’t contextualizebased on past conversations). Stay tuned for more details on how we take thisand similar research areas forward. Of course concerns around paralinguistic prediction accuracies areextremely important to take something like this in production. ↩ "
}, {
"id": 38,
"url": "/confidence-calibration/",