Skip to content

Conversation

@chenghao-mou
Copy link
Contributor

@chenghao-mou chenghao-mou commented Nov 20, 2025

  • New barge-in detector under inference
  • Two stream implementation:
    • HTTP endpoints for working with hosted model
    • WS for working with gateway proxy

Detailed spec can be found in the notion page

@chenghao-mou chenghao-mou force-pushed the chenghaomou/agt-2182-barge-in-detector-interface branch 2 times, most recently from 1fe7b1b to 783b91a Compare November 21, 2025 15:48
@chenghao-mou chenghao-mou changed the title Add barge plugin and examples Add inference bargein and examples Nov 21, 2025
@chenghao-mou chenghao-mou force-pushed the chenghaomou/agt-2182-barge-in-detector-interface branch from 783b91a to 3d9d0af Compare November 24, 2025 11:51
Copy link
Member

@theomonnom theomonnom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this file to other

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove it before merging. I don't think the user will need it.

metrics,
room_io,
)
from livekit.agents.inference.bargein import BargeinDetector
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it just be:

Suggested change
from livekit.agents.inference.bargein import BargeinDetector
from livekit.agents.inference import BargeinDetector

# See more at https://docs.livekit.io/agents/build/turns
turn_detection=MultilingualModel(),
vad=ctx.proc.userdata["vad"],
bargein_detector=BargeinDetector(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this pattern (it follows how we use STT/TTS in the inference gateway in our docs)

Suggested change
bargein_detector=BargeinDetector(),
bargein_detector=inference.BargeinDetector(),

Tho I'm also wondering if it should always be there by default.

# emit the preceding sentinel event immediately before this event
# assuming *only one* sentinel event could precede the current event
# ignore if the previous event is not a sentinel event
logger.debug(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that we're only re-emitting events inside _transcript_buffer when we receive a new STT event. Shouldn't we also re-emit it if we detect it was a bargin? The STT may not send us any additional events, so there is a case where we just ignore this buffered event.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I don't have a return in this elif branch. So that after-the-barge-in event will be processed normally.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is what if this scenario happens:

  1. should_hold_event is True
  2. we buffer the transcript
  3. we detected it was indeed a barge-in
  4. we never receive new transcripts from the STT

In this case it seems like we just lose the buffered transcript and we don't generate any new generation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question! Here is a breakdown of what I am thinking when there is no new transcript:

Scenario 1: If there is no actual user speech/barge-in

This is a barge-in false positive. If resume_false_interruption is enabled, the audio will pause and then resume. If not enabled, it will interrupt with no new transcript.

Scenario 2: If there is actual user speech/barge-in but no new transcript

Case 1: This is an STT failure. But it should behave similarly to Scenario 1.

Case 2: The transcript for the barge-in comes before the inference is done

The typical audio needed for the model for a barge-in detection is around 300ms. This means we can lose the transcript for that 300ms window. I think it is very rare to have a true barge-in for just 300ms of speech and STT finishes transcribing that before it's detected.

But there is a non-zero probability of barge-in detection being late, in which case, we might lose more transcript. What I can do here is to re-emit transcripts back until the last speaking point (if we have the timing information), but it might include non-barge-in transcript if the user says something like "Right, right [bc], we should ...[barge-in]".

Copy link
Contributor Author

@chenghao-mou chenghao-mou Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I have updated it so that it will flush either at the end of agent speech or at new STT events up to the last overlap speech start.

@chenghao-mou chenghao-mou marked this pull request as ready for review November 24, 2025 21:10
@chenghao-mou chenghao-mou requested a review from a team November 24, 2025 21:11
@chenghao-mou chenghao-mou force-pushed the chenghaomou/agt-2182-barge-in-detector-interface branch 2 times, most recently from a2c1e7e to 1326c24 Compare November 27, 2025 11:19
@chenghao-mou chenghao-mou force-pushed the chenghaomou/agt-2182-barge-in-detector-interface branch from 2bc9cf5 to bd0115d Compare November 27, 2025 12:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants