whisper : mark speakers/voices (diarization) #64

abelbabel · 2022-10-18T09:33:00Z

Hi,

I'm not so much into the details of whisper or whisper.cpp and I don't know if it is currently even possible with the foundation, but it would be nice if speakers could be marked or speaker-changes / voice-changes.

This would be very handy when processing interviews, radio/tv shows, films, etc.

Kind regards,
abelbabel

ArtyomZemlyak · 2022-10-19T11:56:27Z

I think its a very not easy task - about quality.
I recomend use for this another model. But in my research of this field, now not exist very good open source solution for this.
But u can check pyannote for this. Some already implemented it with whisper usage:
https://github.com/Majdoddin/nlp

abelbabel · 2022-10-19T12:29:13Z

yeah, also saw this

openai/whisper#264

Seems as if they do it with two runs: one for the spoken text, one for the speakers and then merging the results.

jaybinks · 2022-11-05T11:08:05Z

Personally, id be more than happy for whisper to just do speaker detection based on left & right channels on a stereo audio file. But I can achieve this by just running it twice.

ggerganov · 2022-11-05T20:50:12Z

@jaybinks
This can be added very easily as a built-in option.
A naive algorithm would be for each transcribed segment to measure the signal energy during the time interval for that segment in the 2 channels and predict the speaker based on which one is bigger.

R4ZZ3 · 2022-11-23T11:51:43Z

One option would be to use pyannote.audio to diarize first --> then run whisper on each recognized section @abelbabel

Not tested - I don't have stereo dialog audio

ggerganov · 2022-11-25T20:10:57Z

@jaybinks
Added support for stereo-channel diarization - add the --diarize argument to main.
Not sure if it works, because I don't have any data to test with

abelbabel · 2022-11-26T23:51:05Z

Personally, id be more than happy for whisper to just do speaker detection based on left & right channels on a stereo audio file. But I can achieve this by just running it twice.

Does this approach have the assumption that you only have two speakers and each speaker is well separated each on a single channel? This is a special case which is only applicable to special recordings in an audio studio - from my point of view. Or am I wrong?

jaybinks · 2022-11-26T23:55:50Z

This absolutely is a special case, but its also simple to implement and allows the problem to be broken up. I'm lucky that in my scenario, I have a separate mic per speaker in the conversation so it's perfectly isolated.

…

On Sun, 27 Nov 2022, 9:51 am abelbabel, ***@***.***> wrote: Personally, id be more than happy for whisper to just do speaker detection based on left & right channels on a stereo audio file. But I can achieve this by just running it twice. Does this approach have the assumption that you only have two speakers and each speaker is well separated each on a single channel? This is a special case which is only applicable to special recordings in an audio studio - from my point of view. Or am I wrong? — Reply to this email directly, view it on GitHub <#64 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALQR62XRG2NRLGNR5BEUQLWKKO7HANCNFSM6AAAAAARH4FNAI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

savchenko · 2022-12-05T07:05:48Z

I've done some limited testing and was able to achieve reasonable split via pyannote.
Bolting it all together is a different story though.

chris-english · 2022-12-07T00:58:17Z

Interestingly, in a mono-channel with two speakers, 1st speaker says three words, second speaker repeats those three words, and the transcript result is three words, expanded to the time of the two speakers as though a kind of
DTW were in operation. Sigh, WAV unsupported file type, so mp4.
https://user-images.githubusercontent.com/2199766/206061513-9afff328-ef22-40a8-9d80-727e65cf6dbc.mp4

WEBVTT

00:00:00.000 --> 00:00:04.000
No ifs ands or

00:00:04.000 --> 00:00:08.000
buts.
The above doesn't use --diarize of course.

ggerganov · 2022-12-10T10:15:41Z

@chris-english
I tired running the original PyTorch implementation with and without beam search and sometimes it gets the second phrase, but sometimes it does not, so I think it is a limitation of the model (or the decoding strategy) and not whisper.cpp:

Results with OpenAI Whisper

 12:04:18  $  time whisper --model base.en --best_of None --beam_size None ~/Downloads/repit_12.wav 
[00:00.000 --> 00:08.000]  No ifs ands or buts.

real	0m1.713s
user	0m4.271s
sys	0m0.527s

 12:04:23  $  time whisper --model base.en ~/Downloads/repit_12.wav 
[00:00.000 --> 00:05.000]  No ifs ands or buts.
[00:05.000 --> 00:07.000]  No ifs ands or buts.
[00:07.000 --> 00:34.000]  Okay.

real	0m3.834s
user	0m8.992s
sys	0m3.402s

 12:04:32  $  time whisper --model medium.en --best_of None --beam_size None ~/Downloads/repit_12.wav 
[00:00.000 --> 00:08.000]  No ifs, ands or buts.

real	0m8.247s
user	0m15.943s
sys	0m2.499s

 12:04:56  $  time whisper --model medium.en --beam_size None ~/Downloads/repit_12.wav 
[00:00.000 --> 00:08.000]  No ifs, ands or buts.

real	0m8.280s
user	0m14.941s
sys	0m3.509s

 12:05:17  $  time whisper --model medium.en ~/Downloads/repit_12.wav 
[00:00.000 --> 00:08.000]  No ifs, ands or buts.

real	0m18.790s
user	0m44.693s
sys	0m16.823s

 12:05:39  $  time whisper --model large ~/Downloads/repit_12.wav 
Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: English
[00:00.000 --> 00:08.000]  No ifs, ands or buts.

jaybinks · 2022-12-13T10:15:19Z

Im so sorry this took ages for me to test for you... but the detection seems to work PERFECTLY! Sorry, I cant comment for the output file formats for multi-speaker ( srt, vtt etc ) as I don't know these file formats. I'm assuming that the speaker is available in the segment callback?

…

On Sat, 26 Nov 2022 at 06:11, Georgi Gerganov ***@***.***> wrote: @jaybinks <https://github.com/jaybinks> Added support for stereo-channel diarization - add the --diarize argument to main. Not sure if it works, because I don't have any data to test with — Reply to this email directly, view it on GitHub <#64 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALQR67RDHYOMVQSR4SVS43WKEMNZANCNFSM6AAAAAARH4FNAI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Sincerely Jay

ggerganov · 2022-12-13T21:23:39Z

Great to hear! Btw, a failure case has been identified earlier when multiple speakers end up in the same segment: #216 (comment)

Overall, this is a pretty basic approach and probably not worth investing too much time in it.
I have some ideas for a more general speaker detection approach at the audio embedding level, but not sure if I'll get to that anytime soon. Will see

abelbabel · 2022-12-14T10:46:00Z

I've done some limited testing and was able to achieve reasonable split via pyannote. Bolting it all together is a different story though.

@savchenko Could you give a small how-to on how you used pyannote? By the way: does pyannote require a GPU or can it be used like whisper.cpp with a CPU-only?

SageRalph · 2022-12-14T10:54:18Z

In my testing pyannote.audio is extremely slow on CPU. Very interested if anyone finds a way to make it work.

savchenko · 2022-12-14T11:27:18Z

@abelbabel , https://gist.github.com/savchenko/f009a01bba39e8cd5c7f53267071130a

aldo-roman · 2023-03-29T11:58:40Z

@ggerganov When running whisper.cpp, I get the speaker information only on the stdout result (I think it is VTT format), but the output JSON file does not include this.

Is there a way to show the speaker information in the JSON format?

Not tested - I don't have stereo dialog audio

SpusellaLo · 2023-05-25T11:57:59Z

I am not into technical specifics, just a user of an AI transcription tool that uses this library. For me it would be perfect if the system could detect different speakers and just label the line's where a new speaker starts. similar to the time stamps. Fingers crossed that will works sometime soon :-)

akashmjn · 2023-05-27T14:23:50Z

Hi @ggerganov (and other maintainers of this awesome project!) - you might be interested in an early prototype that covers @SpusellaLo's use case over at https://github.com/akashmjn/tinydiarize

This was designed keeping in mind ease of integration into whisper.cpp as the model structure is exactly the same, inference requires no extra dependencies (beyond the original repo), and it has marginal extra runtime cost.

It can be run as whisper --model small.en-tdrz AUDIO, the only change is the small.en-tdrz model instead of small.en.

Let me know what you think!

Note that this is an early prototype, so while it has quite decent quality, there are still some rough edges. However it should be functionally complete enough to start testing an integration.

pratikmohanty · 2023-06-05T15:42:30Z

@akashmjn Great work!! I converted the small.en-trdz.pt to ggml using the whisper.cpp python script. I used the newly generated ggml model with whisper.cpp using the -m option but it doesn't seem to work. May be there is something else that I missing besides converting it to ggml?

akashmjn · 2023-06-08T23:36:20Z

Thanks for the effort @pratikmohanty. The small.en-tdrz checkpoint has the same structure, so it should convert and decode as normal.

However to surface <|speakerturn|> tokens, edits are required to inference code to allow them to be appropriately decoded and rendered.

Here's a high-level implementation plan:

configurable remap of the unused vocab.solm token (that has been repurposed for speaker turns)

whisper.cpp/whisper.cpp

Line 382 in 57543c1

id token_solm = 50361; // ??
update all places where this token is suppressed and add another rule to timestamp logit filtering Force decode timestamp after speaker turn akashmjn/tinydiarize#11

whisper.cpp/whisper.cpp

Line 3548 in 57543c1

logits[vocab.token_solm] = -INFINITY;

update rendering of token ids to text as appropriate

whisper.cpp/whisper.cpp

Lines 4539 to 4542 in 57543c1

    
           if (params.print_special == false && tokens_cur[i].id >= whisper_token_eot(ctx)) { 
        
           } else { 
        
               text += whisper_token_to_str(ctx, tokens_cur[i].id); 
        
           }

I'm wrapping up some things on my original repo after which I'll have a draft PR open shortly.

In the meantime @ggerganov - how does this sound? Feel free to add any other code pointers in case there's something i've missed!

jordibruin · 2023-06-15T21:29:16Z

@akashmjn that looks amazing! Can't wait to see how this performs!

akashmjn · 2023-06-20T18:35:00Z

For anyone keen to give it a spin, I have an early hack over at https://github.com/akashmjn/whisper.cpp/tree/tdrz-hack-1

make
./models/download-ggml-model.sh small.en-tdrz

make samples
./main -m models/ggml-small.en-tdrz.bin -f samples/a13.wav

After running the above, you should see this:

(tried to pick a sample keeping with the historical vibe of the others 😉 )

Will open a PR after some cleanup. In the meantime if you have any suggestions - feel free to drop comments directly on the branch!

ggerganov · 2023-06-25T12:07:31Z

Awesome stuff! Looked at the branch - seems super clean

crohr · 2023-06-27T14:57:34Z

@ggerganov When running whisper.cpp, I get the speaker information only on the stdout result (I think it is VTT format), but the output JSON file does not include this.

Is there a way to show the speaker information in the JSON format?

:+1, it would be great if the speaker details would be present in the JSON output. Currently it's hard to make use of them.

akashmjn · 2023-06-27T16:12:25Z

@ggerganov When running whisper.cpp, I get the speaker information only on the stdout result (I think it is VTT format), but the output JSON file does not include this.
Is there a way to show the speaker information in the JSON format?

:+1, it would be great if the speaker details would be present in the JSON output. Currently it's hard to make use of them.

I assume you are referring to previous comment pertaining to the --diarize flag that currently preserves speaker/channel tags when processing a stereo audio file? If so, I believe it was fixed recently in #1031.

For tinydiarize (that handles a mono audio file) i'm implementing something similar so speaker turns are marked in the output file. I'm adding a field to each JSON segment as below.

Example

		{
			"timestamps": {
				"from": "00:00:00,000",
				"to": "00:00:03,820"
			},
			"offsets": {
				"from": 0,
				"to": 3820
			},
			"text": " Then these neural nets take on pretty surprising magical",
			"speaker_turn_next": true
		},

For the rest of the output types (txt/vtt/srt/lrc/wts/csv) - it will only be present in the text transcription as you saw in the apollo example above. Hope that works.

crohr · 2023-06-27T16:39:25Z

@akashmjn Yes indeed, thanks for the pointer!

akashmjn · 2023-06-27T18:22:02Z

Awesome stuff! Looked at the branch - seems super clean

@ggerganov - just opened an initial PR at #1058. Need some comments on how best to expose / integrate this.

bachittle · 2023-09-19T19:15:37Z

should this issue be closed now?

carljmosca · 2023-09-19T20:16:30Z

Are there plans to include speaker number instead of "speaker turn"? One use case could be audio files with more than two speakers.

bachittle · 2023-09-19T20:51:31Z

https://github.com/akashmjn/tinydiarize#gotchas indicates that tinydiarize does not support speaker clustering, which is what you are referring to. A different diarization implementation would be needed to solve that problem, or to wait for this feature to be added to tinydiarize.

carljmosca · 2023-09-19T22:05:08Z

I noticed that but I believe I also saw speaker followed by a number in the docs. Thank you

bachittle · 2023-09-20T14:47:32Z

There are two strategies for diarization that are implemented so far. One of which is stereo diarization, which allows for speaker numbers: #1031. You enable that with --diarize. It requires stereo audio because it essentially determines the location of the speakers voices.

Tiny diarize is a different approach, and is enabled with -tdrz. It allows for mono audio, because it uses a different strategy of fine-tuning the whisper model to determine speakers by their voice timbre, not just location.

Both strategies have their flaws and have different purposes, but are available in the master branch.

carljmosca · 2023-09-20T14:55:52Z

Yes, @bachittle I get that tinydiarize is more recently added and different from separated audio tracks. I was referring to this when I made the comment about the speaker identification. I do see where it may be added later as you previously stated. I probably should have asked this in the tinydiarize project also. I appreciate your time and explanations.

Not tested - I don't have stereo dialog audio

wzxu · 2023-12-02T04:32:06Z

Hi. I saw earlier discussions mentioning pyannote.audio, but my understanding is that this is not integrated, right?
I tried insanely-fast-whisper on a short YouTube clip in Chinese and it works quite well (obviously not perfect; also I'm on a Mac but it ran pretty fast so I'm not sure if it's CPU or mps), but I currently have no way to do so directly with whisper.cpp.

--diarize: depends on stereo channels
--tinydiarize: only works with English

So I suppose this ticket could remain open since there's still chance to improve for multilingual use case?

bachittle · 2023-12-04T18:39:40Z

@wzxu yes, insanely-fast-whisper uses pyannote.audio, as does lots of other libraries for whisper diarization like WhisperX. Ticket can remain open until we get quality as good as pyannote.audio for multilingual use case, or make that a separate issue.

Guthman · 2023-12-18T19:29:18Z

Thanks for the great work on this. Is it straightforward to use tinydiarize with the larger models, not just the tiny one?

Ace-myu · 2024-01-10T15:04:04Z

@jaybinks This can be added very easily as a built-in option. A naive algorithm would be for each transcribed segment to measure the signal energy during the time interval for that segment in the 2 channels and predict the speaker based on which one is bigger.

Is there a python version of this?

don't remove bindings/javascript/package.json during build

thewh1teagle · 2024-05-29T22:13:48Z

Looking at github.com/akashmjn/tinydiarize
Looks like in the Python version it support speaker labeling, not just speaker turns.
Any chance we can get speaker labeling in whisper.cpp too?

clort81 · 2024-06-02T04:05:07Z

It would also be of some help if the diarization info appeared in the subtitle output when -osrt is given. Currently I have to parse the stdout data.
And 'speaker change' is not diarization since the program is not assigning text to individual speakers.
Are there any true diarization options that don't require (shudder) python-AI?

dweidenfeld · 2024-11-23T10:03:04Z

Any updates on this? I know that e.g. Krisp.ai already has it integrated.

ggerganov added the enhancement New feature or request label Oct 18, 2022

ggerganov added the low priority label Nov 1, 2022

ggerganov added the good first issue Good for newcomers label Nov 7, 2022

ggerganov changed the title ~~[Feature] mark speakers/voices~~ [Feature] mark speakers/voices (diarization) Nov 22, 2022

ggerganov mentioned this issue Nov 22, 2022

Any way to add speaker diarization feature to this code base? #169

Closed

ggerganov added a commit that referenced this issue Nov 25, 2022

main : add stereo-channel-based diarization (#64)

0f619b5

Not tested - I don't have stereo dialog audio

ggerganov removed the good first issue Good for newcomers label Nov 27, 2022

This was referenced Dec 1, 2022

Label different speakers #202

Closed

--diarize flag is unreliable #216

Closed

FlakM mentioned this issue Feb 19, 2023

Initial transcription support JupiterBroadcasting/jupiterbroadcasting.com#494

Closed

anandijain pushed a commit to anandijain/whisper.cpp that referenced this issue Apr 28, 2023

main : add stereo-channel-based diarization (ggerganov#64)

39bc66a

Not tested - I don't have stereo dialog audio

ggerganov changed the title ~~[Feature] mark speakers/voices (diarization)~~ whisper : mark speakers/voices (diarization) Jun 25, 2023

ggerganov added this to ggml : roadmap Jun 25, 2023

ggerganov assigned akashmjn Jun 25, 2023

ggerganov moved this to In Progress in ggml : roadmap Jun 25, 2023

akashmjn mentioned this issue Jun 27, 2023

whisper : support speaker segmentation (local diarization) of mono audio via tinydiarize #1058

Merged

7 tasks

ggerganov moved this from In Progress to Done in ggml : roadmap Jul 23, 2023

jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this issue Oct 24, 2023

main : add stereo-channel-based diarization (ggerganov#64)

fb163e6

Not tested - I don't have stereo dialog audio

kultivator-consulting pushed a commit to KultivatorConsulting/whisper.cpp that referenced this issue Feb 12, 2024

Merge pull request ggerganov#64 from chrisrude/master

f1436c0

don't remove bindings/javascript/package.json during build

whisper : mark speakers/voices (diarization) #64

whisper : mark speakers/voices (diarization) #64

Comments

abelbabel commented Oct 18, 2022

ArtyomZemlyak commented Oct 19, 2022

abelbabel commented Oct 19, 2022

jaybinks commented Nov 5, 2022

ggerganov commented Nov 5, 2022

R4ZZ3 commented Nov 23, 2022 • edited Loading

ggerganov commented Nov 25, 2022

abelbabel commented Nov 26, 2022

jaybinks commented Nov 26, 2022 via email

savchenko commented Dec 5, 2022

chris-english commented Dec 7, 2022 • edited Loading

ggerganov commented Dec 10, 2022 • edited Loading

jaybinks commented Dec 13, 2022 via email

ggerganov commented Dec 13, 2022

abelbabel commented Dec 14, 2022 • edited Loading

SageRalph commented Dec 14, 2022

savchenko commented Dec 14, 2022

aldo-roman commented Mar 29, 2023

SpusellaLo commented May 25, 2023

akashmjn commented May 27, 2023 • edited Loading

pratikmohanty commented Jun 5, 2023

akashmjn commented Jun 8, 2023

jordibruin commented Jun 15, 2023

akashmjn commented Jun 20, 2023 • edited Loading

ggerganov commented Jun 25, 2023

crohr commented Jun 27, 2023

akashmjn commented Jun 27, 2023

crohr commented Jun 27, 2023

akashmjn commented Jun 27, 2023 • edited Loading

bachittle commented Sep 19, 2023

carljmosca commented Sep 19, 2023

bachittle commented Sep 19, 2023

carljmosca commented Sep 19, 2023

bachittle commented Sep 20, 2023

carljmosca commented Sep 20, 2023

wzxu commented Dec 2, 2023 • edited Loading

bachittle commented Dec 4, 2023

Guthman commented Dec 18, 2023

Ace-myu commented Jan 10, 2024

thewh1teagle commented May 29, 2024

clort81 commented Jun 2, 2024 • edited Loading

dweidenfeld commented Nov 23, 2024

R4ZZ3 commented Nov 23, 2022 •

edited

Loading

chris-english commented Dec 7, 2022 •

edited

Loading

ggerganov commented Dec 10, 2022 •

edited

Loading

abelbabel commented Dec 14, 2022 •

edited

Loading

akashmjn commented May 27, 2023 •

edited

Loading

akashmjn commented Jun 20, 2023 •

edited

Loading

akashmjn commented Jun 27, 2023 •

edited

Loading

wzxu commented Dec 2, 2023 •

edited

Loading

clort81 commented Jun 2, 2024 •

edited

Loading