Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bidi streaming proposal end of utterance detection #13

Open
seyuf opened this issue Jan 28, 2020 · 2 comments
Open

Bidi streaming proposal end of utterance detection #13

seyuf opened this issue Jan 28, 2020 · 2 comments
Labels
enhancement New feature or request

Comments

@seyuf
Copy link

seyuf commented Jan 28, 2020

Hi,

Much thanks for this awesome work!
I have a use case deriving from my use of the project. And I thought it was worth exposing here, as it believe it can be implemented directly on the main branch.

If i've already implemented some kind of PoC or v1 here.

The idea would be to, add silence/ end of utterance detection to the server.
Today, what i observe is that in bidistreaming, the server is transcribing indefinitely streams of messages sent from the client. Appending the results at each iteration.
So if one wants to reset (the result), one is forced to kill the connection, from the client.

What i made in the above link is kinda similar, i just send from the client side in the audio config message end_of_utterance value, which tells the server im done. Send me the last result and close the connection. I also set in the last result massage, some is_final value signalling that this is the last result from the server and that the connection has been closed to the client.
Although this works, it is not very satisfying, as to me the right thing would be the keep the connection alive but just reset the results when an utterance has ended. I also believe that the server could also do the end of utterance detection using silence detection.

The idea would be to consider that was at the end of an utterance, if we receive silent audio for some amount of time or iteration (the code seems already in place here)
So:

  1. client specify in the message /audio config if it would like the server to detect the end of utterances. (if not we keep the current behaviour)
  2. Client sends streams of messages
  3. After multiple consecutives empty audio decoding the server decides, we're at an end of utterance
  4. Server send back result with ( is_final set to true in the response message).
  5. Server reset data, but keeps connection alive (or may be killing it? Could be optional), waiting for new input from client.

I hope it the understandable enough? If so i would like some feedback, if possible?

Regards

@seyuf seyuf closed this as completed Feb 16, 2020
@lepisma
Copy link
Contributor

lepisma commented Feb 17, 2020

Hey, @seyuf can we reopen this? the feature is something we haven't considered yet but will like to have some discussion before closing.

Not guarantying a discussion now but let's keep this open :)

@seyuf
Copy link
Author

seyuf commented Feb 17, 2020

Hi, sure np.

@seyuf seyuf reopened this Feb 17, 2020
@pskrunner14 pskrunner14 added the enhancement New feature or request label Jun 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants