AI-172: Foundation for Deepgram multilingual evaluation by aksh08022006 · Pull Request #99 · openMF/community-ai

aksh08022006 · 2026-03-10T18:52:47Z

jira-https://mifosforge.jira.com/browse/AI-172
Adds Deepgram multilingual STT evaluation framework:

providers/deepgram.py — Deepgram API wrapper for async transcription
metrics/accuracy.py — WER/CER calculation
dataset/ai172_languages.py — Language targets (Hindi, French, Portuguese, English)
dataset/create_hf_dataset.py — HuggingFace dataset builder with ground truth transcriptions + upload script
generate_report.py — Per-language metrics report (WER, CER, MOS)

Languages: English, Hindi, French, Portuguese

Testing Evidence

All core functionality has been tested locally and is fully operational:

Test Results Summary

Component	Status	Details
Metrics (WER/CER)	✅ Working	Perfect match WER=0.0, error case WER=0.3333
Dataset Module	✅ Working	6 languages configured, 5 financial domains
HuggingFace Builder	✅ Working	20 multilingual samples ready for benchmarking
Report Generator	✅ Working	JSON report structure verified

…king Add whisper.cpp as git submodule with setup documentation and minimal evaluation runner for AI-167 (multilingual support evaluation). Changes: - evaluation/whisper.cpp: C/C++ Whisper for on-device mobile inference - evaluation/SETUP.md: Build and setup instructions - evaluation/run_multilingual_eval.py: Thin wrapper to run benchmarks This enables evaluation of whisper-tiny, whisper-base, and whisper-small across 5 languages (en, hi, es, fr, de) with focus on mobile performance metrics: WER, CER, latency, model size, memory usage. Addresses: AI-167

Add docstrings and section headers explaining: - Why whisper.cpp was chosen over Python/HuggingFace (mobile focus) - Key design decisions (minimal code, validate state, use existing tools) - Collaborative refinement with Claude Opus 4.5 This makes the contribution more human and transparent about the thought process, not just presenting final code.

…dback Per Pronay Sarker's guidance, add comprehensive results documentation that includes: - Jira ticket reference (AI-167) - Repository link for mentor review and reproduction - Reproduction steps - Results template (WER/CER, latency, model size) - Findings section for analysis - Next steps for follow-up work This provides mentors with a single source of truth for: 1. How to access the work 2. How to reproduce it 3. Where findings will be documented

Added a note about using AI model Claude Opus 4.6 for refinement.

Refine script with insights from Claude Opus 4.5 and related benchmarks.

Modified the benchmarking process to iterate over multiple models, running benchmarks for each model individually.

- Add llms.txt for Claude/AI tool compatibility - Add SKILL.md defining /benchmark-speech and /evaluate-language workflows - Add Deepgram STT provider wrapper with async streaming - Add WER/CER metrics calculation for accuracy measurement - Add language targets (English, Spanish, Hindi, Swahili, French, Portuguese) Total: 426 lines across 5 focused files

Copilot

Pull request overview

Adds foundational pieces for multilingual speech evaluation, including a Deepgram STT provider wrapper, accuracy metrics, and dataset language definitions, plus supporting documentation and a whisper.cpp evaluation submodule.

Changes:

Added Deepgram async transcription provider scaffold under benchmarking_experiments/.
Implemented basic WER/CER computation utilities and a small language/domain reference dataset.
Added evaluation docs/scripts (whisper.cpp submodule + runner) and AI-facing project documentation (skills + llms.txt).

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
llms.txt	Adds high-level project + evaluation documentation for AI tooling.
evaluation/whisper.cpp	Adds whisper.cpp as a git submodule pointer for evaluations.
evaluation/run_multilingual_eval.py	Introduces a runner script intended to orchestrate whisper.cpp benchmarks.
evaluation/SETUP.md	Documents how to initialize/build whisper.cpp and run the evaluation script.
evaluation/RESULTS.md	Adds a results template for AI-167 whisper.cpp multilingual evaluation.
benchmarking_experiments/providers/deepgram.py	Adds an async Deepgram STT provider wrapper for AI-172 experiments.
benchmarking_experiments/metrics/accuracy.py	Adds text normalization + edit-distance-based WER/CER helpers.
benchmarking_experiments/dataset/ai172_languages.py	Defines target languages/domains and sample reference phrases.
SKILL.md	Documents intended “skills”/workflows for benchmarking automation.
PR_AI172_FOUNDATION.md	Adds a PR summary doc and basic import-based “testing” notes.
.gitmodules	Registers whisper.cpp as a submodule.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-10T18:55:19Z

+    _, insertions, deletions, substitutions = edit_distance(list(ref_chars), list(hyp_chars))
+    cer = (substitutions + deletions + insertions) / len(ref_chars)


This CER implementation includes spaces as characters because normalize_text() returns a space-delimited string. CER is typically computed over characters excluding whitespace (otherwise spacing differences can dominate the score). Consider removing spaces before list(...) (e.g., stripping all whitespace after normalization) so CER better reflects character recognition.

- Use relative import in deepgram.py (from .base) - Use list + join for audio buffer (avoid O(n²) concat) - Convert Deepgram SDK response to dict before parsing - Strip whitespace in CER calculation for accurate character comparison - Pass model and language params to bench.py in eval runner - Update docstring to match actual implementation behavior - Fix grammar in RESULTS.md header

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

staru09 · 2026-03-11T04:31:58Z

Remove files (llms.txt, skills.md) etc cause we don't want these as of now.

Per mentor feedback (Aru Sharma): - Create benchmarking dataset first (Hindi, French, Portuguese, English) - Upload to HuggingFace for community reuse - Output metrics report showing per-language performance Added: - dataset/create_hf_dataset.py: HF-compatible dataset with ground truth transcriptions, metadata.csv, dataset card, and upload to HF Hub - generate_report.py: Runs evaluation and outputs per-language WER/CER/MOS report (JSON + human-readable table) - requirements.txt: Added deepgram-sdk, huggingface_hub, datasets

staru09: 'Remove files (llms.txt, skills.md) etc cause we don't want these as of now'

aksh08022006 · 2026-03-11T13:44:59Z

Remove files (llms.txt, skills.md) etc cause we don't want these as of now.

Yeah Sure , Removed llms.txt, skills.md

aksh08022006 · 2026-03-11T13:50:33Z

@openMF/ai-community-maintainer, please re-review and letme know if any more changes are required.

staru09 · 2026-03-11T14:14:38Z

we don't want to keep whisper.cpp submodule
can we do it without a submodule ??

DavidH-1 · 2026-03-11T14:41:34Z

CLA check ok

aksh08022006 · 2026-03-11T17:48:27Z

we don't want to keep whisper.cpp submodule can we do it without a submodule ??

I (A) make whisper.cpp optional and add a small tools/install_whisper_cpp.sh (recommended, quick), or (B) remove it entirely and use Docker/prebuilt binaries instead?
which one do you suggest if we dont want it i can make a follow up commit for B

- All core modules tested and working locally - Metrics (WER/CER) calculation verified - Multilingual dataset structure confirmed (6 languages, 20 samples) - Report generator functional - Test script and results for future reference

Copilot

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 9 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…coverable, remove submodule, fix Deepgram options, make generate_report dry-run friendly, encode README utf-8, align languages

staru09 · 2026-03-17T06:47:33Z

I see a bunch of cache files in this, remove them
clean up any extra files then I think I'll be able to review better
and is this related to whisper.cpp or deepgram?

…13.pyc

…python-313.pyc

…313.pyc

….cpython-313.pyc

…313.pyc

DavidH-1 · 2026-03-17T07:20:11Z

Why did you do a commit or each file delete would have been clean to do a single commit incorporating them all and saved all the email notifications 😬

aksh08022006 · 2026-03-17T07:20:31Z

I see a bunch of cache files in this, remove them clean up any extra files then I think I'll be able to review better and is this related to whisper.cpp or deepgram?

AI-172 is for Deepgram, there was a branch mess up between the two tasks AI-167 was for whisper implementation and AI-172 is for deepgram .
i have removed the cache files
please letme know if anything is required

aksh08022006 · 2026-03-17T07:27:06Z

Why did you do a commit or each file delete would have been clean to do a single commit incorporating them all and saved all the email notifications 😬

yeah sorry i was trying to clean it flie by file ,letme create one commit to serve the complete purpose.

aksh08022006 · 2026-03-17T07:49:25Z

@staru09 you can now review can ignore whisper.cpp

staru09 · 2026-03-27T13:22:16Z

why it's called smoke_test?
rest looks good.

… all unique requirements

aksh08022006 · 2026-03-28T07:34:15Z

why it's called smoke_test? rest looks good.

smoke test usually means surface testing before complete benchmarking.
but if you want i can change that to something else

staru09 · 2026-04-02T15:14:38Z

resolve merge conflict then I can merge this

aksh08022006 · 2026-04-02T18:04:50Z

resolve merge conflict then I can merge this

done , solved merge conflict , you can now merge .

aksh08022006 added 9 commits March 5, 2026 12:37

Refine RESULTS.md with AI model details

bd3ceee

Added a note about using AI model Claude Opus 4.6 for refinement.

Refine run_multilingual_eval.py with new insights

4b457f4

Refine script with insights from Claude Opus 4.5 and related benchmarks.

Iterate over models for individual benchmarking

dac919b

Modified the benchmarking process to iterate over multiple models, running benchmarks for each model individually.

Use sys.executable for Python command

1ddcec0

Merge branch 'openMF:dev' into dev

4a9b3e2

aksh08022006 requested review from a team and Copilot March 10, 2026 18:52

Copilot AI reviewed Mar 10, 2026

View reviewed changes

aksh08022006 requested a review from Copilot March 10, 2026 19:10

Copilot AI reviewed Mar 10, 2026

aksh08022006 added 2 commits March 11, 2026 18:56

Remove llms.txt, SKILL.md, PR_AI172_FOUNDATION.md per reviewer feedback

f87ed30

staru09: 'Remove files (llms.txt, skills.md) etc cause we don't want these as of now'

aksh08022006 added 3 commits March 14, 2026 19:30

Merge branch 'openMF:dev' into feature/ai172-deepgram-multilingual

f431315

docs: add PR description template with testing evidence

e781cf9

aksh08022006 requested a review from Copilot March 14, 2026 14:42

Copilot AI reviewed Mar 14, 2026

View reviewed changes

ci/docs: address Copilot review suggestions — make smoke test non-dis…

bd7d489

…coverable, remove submodule, fix Deepgram options, make generate_report dry-run friendly, encode README utf-8, align languages

Copilot AI mentioned this pull request Mar 16, 2026

AI-56 Switch Local LLM to Pre-Quantized GPTQ ModelFeature/gptq mistral optimization #104

Open

aksh08022006 added 10 commits March 17, 2026 12:35

Delete benchmarking_experiments/__pycache__/generate_report.cpython-3…

bbe0da0

…13.pyc

Delete benchmarking_experiments/dataset/__pycache__/ai172_languages.c…

875fc32

…python-313.pyc

Delete benchmarking_experiments/metrics/__pycache__/__init__.cpython-…

8a7b37f

…313.pyc

Delete benchmarking_experiments/dataset/__pycache__/create_hf_dataset…

874150b

….cpython-313.pyc

Delete benchmarking_experiments/metrics/__pycache__/accuracy.cpython-…

dbf25d9

…313.pyc

Delete UPDATE_PR_DESCRIPTION.md

6b7b165

Delete PR_DESCRIPTION_TEMPLATE.md

11c9686

Delete evaluation/SETUP.md

edbae2f

Delete PR99_TESTING_EVIDENCE.md

c1208be

Delete .gitmodules

527ceb5

staru09 mentioned this pull request Mar 27, 2026

add whisper.cpp integration for AI-167 mobile STT evaluation #90

Open

Resolve merge conflicts: keep class-based deepgram provider and merge…

fbb0563

… all unique requirements

Merge branch 'dev' into feature/ai172-deepgram-multilingual

c0731b8

		_, insertions, deletions, substitutions = edit_distance(list(ref_chars), list(hyp_chars))
		cer = (substitutions + deletions + insertions) / len(ref_chars)

Conversation

aksh08022006 commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results Summary

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

staru09 commented Mar 11, 2026

Uh oh!

aksh08022006 commented Mar 11, 2026

Uh oh!

aksh08022006 commented Mar 11, 2026

Uh oh!

staru09 commented Mar 11, 2026

Uh oh!

DavidH-1 commented Mar 11, 2026

Uh oh!

aksh08022006 commented Mar 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

staru09 commented Mar 17, 2026

Uh oh!

DavidH-1 commented Mar 17, 2026

Uh oh!

aksh08022006 commented Mar 17, 2026

Uh oh!

aksh08022006 commented Mar 17, 2026

Uh oh!

aksh08022006 commented Mar 17, 2026

Uh oh!

staru09 commented Mar 27, 2026

Uh oh!

aksh08022006 commented Mar 28, 2026

Uh oh!

staru09 commented Apr 2, 2026

Uh oh!

aksh08022006 commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

aksh08022006 commented Mar 10, 2026 •

edited

Loading