-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Add inference bargein and examples #4032
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add inference bargein and examples #4032
Conversation
1fe7b1b to
783b91a
Compare
783b91a to
3d9d0af
Compare
theomonnom
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this file to other
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove it before merging. I don't think the user will need it.
examples/voice_agents/basic_agent.py
Outdated
| metrics, | ||
| room_io, | ||
| ) | ||
| from livekit.agents.inference.bargein import BargeinDetector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it just be:
| from livekit.agents.inference.bargein import BargeinDetector | |
| from livekit.agents.inference import BargeinDetector |
examples/voice_agents/basic_agent.py
Outdated
| # See more at https://docs.livekit.io/agents/build/turns | ||
| turn_detection=MultilingualModel(), | ||
| vad=ctx.proc.userdata["vad"], | ||
| bargein_detector=BargeinDetector(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this pattern (it follows how we use STT/TTS in the inference gateway in our docs)
| bargein_detector=BargeinDetector(), | |
| bargein_detector=inference.BargeinDetector(), |
Tho I'm also wondering if it should always be there by default.
| # emit the preceding sentinel event immediately before this event | ||
| # assuming *only one* sentinel event could precede the current event | ||
| # ignore if the previous event is not a sentinel event | ||
| logger.debug( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that we're only re-emitting events inside _transcript_buffer when we receive a new STT event. Shouldn't we also re-emit it if we detect it was a bargin? The STT may not send us any additional events, so there is a case where we just ignore this buffered event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I don't have a return in this elif branch. So that after-the-barge-in event will be processed normally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean is what if this scenario happens:
- should_hold_event is True
- we buffer the transcript
- we detected it was indeed a barge-in
- we never receive new transcripts from the STT
In this case it seems like we just lose the buffered transcript and we don't generate any new generation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great question! Here is a breakdown of what I am thinking when there is no new transcript:
Scenario 1: If there is no actual user speech/barge-in
This is a barge-in false positive. If resume_false_interruption is enabled, the audio will pause and then resume. If not enabled, it will interrupt with no new transcript.
Scenario 2: If there is actual user speech/barge-in but no new transcript
Case 1: This is an STT failure. But it should behave similarly to Scenario 1.
Case 2: The transcript for the barge-in comes before the inference is done
The typical audio needed for the model for a barge-in detection is around 300ms. This means we can lose the transcript for that 300ms window. I think it is very rare to have a true barge-in for just 300ms of speech and STT finishes transcribing that before it's detected.
But there is a non-zero probability of barge-in detection being late, in which case, we might lose more transcript. What I can do here is to re-emit transcripts back until the last speaking point (if we have the timing information), but it might include non-barge-in transcript if the user says something like "Right, right [bc], we should ...[barge-in]".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I have updated it so that it will flush either at the end of agent speech or at new STT events up to the last overlap speech start.
a2c1e7e to
1326c24
Compare
2bc9cf5 to
bd0115d
Compare
Detailed spec can be found in the notion page