Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text streaming support #5

Open
Shulyaka opened this issue Jan 17, 2024 · 7 comments
Open

Text streaming support #5

Shulyaka opened this issue Jan 17, 2024 · 7 comments
Assignees
Labels
enhancement New feature or request

Comments

@Shulyaka
Copy link

It would be good to support chunked text the same way we support chucked audio. The reason is that LLMs produce the text token-by-token, and when the text is big, we would like to start producing the audio via tts right away instead of waiting.

@sdetweil
Copy link

sdetweil commented Jan 17, 2024

well, i fiddled with mine to do that..

and it needs a redesign. can't wait for the response from a chunk, so have to spin off a async task to handle the waits, and if there is text, send it some place to consolidate with prior and maybe signal done
and then on the send side, don't know what happens if the handler blocks while transcribing. does it hold up the next block arriving> buffering.. so one would have to spin off another async thread with a queue to handle the transcribes.. and sends..
and figure out all the audio data alignment to all the interim transcribes.

@sdetweil
Copy link

so I modified my asr to do interim results, on the fly.. but as suspected it will taks some work to figure out what to do with the audio data..

currenlty for test, if the transcriber returns text (not '') then I send that back and drop the audio input saved to here.
effectively starting over... BUT.. this truncates some of the text response..

should be testing testing testing testing

but got test test Washigton testing test with a lot of no text responses in between

returned text=
returned text=
returned text=
returned text=
returned text=
returned text=
returned text=test
returned text=
returned text=
returned text=
returned text=test
returned text=
returned text=
returned text=Washington
returned text=
returned text=
returned text=testing
returned text=
returned text=
returned text=test
returned text=
returned text=
returned text=
returned text=
returned text=
returned text=
returned text=
returned text=
returned text=
returned text=
returned text=

I don't know what my transcriber does under the covers..

@synesthesiam
Copy link
Contributor

I think this could be done with the appropriate start/stop/chunk events. So for ASR/STT response, it could be TranscriptStart, TranscriptStop, TranscriptChunk. This way, the server would be able to differentiate it well from the original Transcript which is the whole thing at once.

@synesthesiam synesthesiam added the enhancement New feature or request label Jan 18, 2024
@synesthesiam synesthesiam self-assigned this Jan 18, 2024
@sdetweil
Copy link

transcript is at the end
transcribe is at the start

I think another Parm on the Transcribe would indicate that the client is enabled for interim results.

transcriptchunk implies the client is processing the chunks somehow
but it's streaming from the mic non stop

it's unlikely that every client would change.

the current whisper sends the results on audio stop, not transcript anyhow.

@sdetweil
Copy link

but that doesn't tell the client if the server will send interim results. currently Transcribe doesn't have a response

@sdetweil
Copy link

sdetweil commented Jan 21, 2024

Maybe we could use the Describe Info response to indicate if the asr supports intermediate responses

then I suppsose a new TranscriptChunk event out from the asr would inform the client

@sdetweil
Copy link

sdetweil commented Nov 26, 2024

I did it with a new parm on Transcript...
#33

doesn't help on the start to know if the event receiver can handle that or if its wasted energy
so Transcript needs a parm to allow requesting partials.. even without knowing IF the stt can do that (which is back on Describe)

I'll add that to #33 for test
adding the sendPartials:bool/False property to Transcribe supports this if the target service can and is enabled. ignored if not.

one has to be sure the receiver can recover from not receiving the 'end' event (AudioStop in asr) as it will have sent the Transcript/final unsolicited

but streaming text to TTS will take a couple changes . the synthesize will need some id/timestamp to synch w others , and if not another event , then a continued:bool/false to indicated with its id that it is more text
both optional

so repeat Synthesize, id=same, continued=true
til last block, id=same, continued=false

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants