Skip to content

Conversation

@preciselyV
Copy link

As discussed in #3918 some STTs may misfire creating false speech recognition. To fix this @chenghao-mou suggested rewriting StreamAdapter to be able to work with stream capable STT's and only send STT's events if VADEvent.START_OF_SPEECH was generated.

Implementation is a bit different from the suggestion:

  1. Input chunks are always sent via push_frame() to both STT and VAD streams instead of starting to send to STT after according VAD events. Decided to go this route, since some STTs may cache previously received frames to improve prediction, and sending after VAD event is guaranteed "eat" at least one chunk with user speech. Don't wanna spoil accuracy at all
  2. Instead of calling stt._recognize_impl() to get NotImplementedError for stream only STTs, we check for stt.capabilities and new parameter called force_stream in StreamAdapter . Calling the stt._recognize_impl() if it is implemented, will introduce unnecessary API call, and since it will be done during initialization will affect performance. It's also better to let user decide which mode he'd like to use

By default force_stream=False in both StreamAdapter and StreamAdapterWrapper for backward compatibility. Unless set to True it will be using old logic, so no harm will be done to anyone who was already counting on it.

@CLAassistant
Copy link

CLAassistant commented Nov 17, 2025

CLA assistant check
All committers have signed the CLA.

):
continue

if event.type == SpeechEventType.FINAL_TRANSCRIPT and status:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can receive FINAL_TRANSCRIPT while the user is still speaking, especially when there is a short pause:

[ User speech 1]...........[pause].....[User speech 2]
..........................................[F 1]...............[F2] 

The first FINAL_TRANSCRIPT will turn off start_of_speech_received after VAD has a START_OF_SPEECH for speech 2. The second one will be ignored.

Compared to VAD events, STT events can also be delayed and reliability varies across vendors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants