Skip to content

AI-174: Evaluate multilingual support for Whisper Small variants#109

Open
ibhoomi16 wants to merge 1 commit into
openMF:devfrom
ibhoomi16:feature/whisper-benchmark
Open

AI-174: Evaluate multilingual support for Whisper Small variants#109
ibhoomi16 wants to merge 1 commit into
openMF:devfrom
ibhoomi16:feature/whisper-benchmark

Conversation

@ibhoomi16

Copy link
Copy Markdown

This PR introduces the benchmarking setup to evaluate the performance of openai/whisper-small models across multiple languages.

  • Built an automated evaluation pipeline to test openai/whisper-small and whisper-small.en base models.
  • Tested the models across 5 target languages/demographics: English (en), Hindi (hi), Spanish (es), French (fr), and Portuguese (pt).
  • Engineered a metric tracking system within the benchmarking_whisper/ directory that successfully calculates and logs:
    • Word Error Rate (WER) (for measuring transcription accuracy).
    • Latency / TTFA (for measuring STT speed and responsiveness).
    • Memory Footprint (for checking on-device low-end mobile constraints).

@ibhoomi16 ibhoomi16 requested a review from a team March 23, 2026 04:29
@ibhoomi16 ibhoomi16 force-pushed the feature/whisper-benchmark branch from 4dabc36 to 937f32b Compare March 23, 2026 04:31
@staru09

staru09 commented Mar 25, 2026

Copy link
Copy Markdown
Member

the WER looks a little too accurate, can you elaborate on what dataset you used and is there any post-processing involved ?

@DavidH-1

Copy link
Copy Markdown
Collaborator

CLA Check = Passed

@ibhoomi16

Copy link
Copy Markdown
Author

The dataset was created by translating common banking commands into 5 languages and generating their audio using TTS. There’s no heavy post-processing just taking the text from the audio filename, replacing underscores with spaces, converting it to lowercase, and comparing it with Whisper’s output using jiwer.

return []

results = []
pid = psutil.Process(os.getpid())

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since our final goal is to deploy the model on the mifos's mobile app, we should consider the memory constraints of a mobile device rather than the machine we are currently using, as they differ significantly.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as well as the latency

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@itsPronay as of now we haven't decided if we want to host a model on client side or no

@itsPronay itsPronay Apr 2, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@staru09 , In the ticket, it’s mentioned that for local models we should measure metrics like memory usage and latency. Could you please clarify which device we should base these measurements on?

If the intention is to run these models locally on a mobile device, the measurements would differ significantly compared to running them on a server (self-hosted). The approach to evaluating memory usage and latency would vary depending on the deployment environment.

So, when we are talking about 'Benchmark local-models (memory and latency)', what are we evaluating against?

  1. Server computer?
  2. Physical Mobile device?
  3. or It is not decided yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants