Skip to content

AI-172: Foundation for Deepgram multilingual evaluation#99

Open
aksh08022006 wants to merge 28 commits into
openMF:devfrom
aksh08022006:feature/ai172-deepgram-multilingual
Open

AI-172: Foundation for Deepgram multilingual evaluation#99
aksh08022006 wants to merge 28 commits into
openMF:devfrom
aksh08022006:feature/ai172-deepgram-multilingual

Conversation

@aksh08022006

@aksh08022006 aksh08022006 commented Mar 10, 2026

Copy link
Copy Markdown

jira-https://mifosforge.jira.com/browse/AI-172
Adds Deepgram multilingual STT evaluation framework:

  • providers/deepgram.py — Deepgram API wrapper for async transcription
  • metrics/accuracy.py — WER/CER calculation
  • dataset/ai172_languages.py — Language targets (Hindi, French, Portuguese, English)
  • dataset/create_hf_dataset.py — HuggingFace dataset builder with ground truth transcriptions + upload script
  • generate_report.py — Per-language metrics report (WER, CER, MOS)

Languages: English, Hindi, French, Portuguese

Testing Evidence

All core functionality has been tested locally and is fully operational:

Test Results Summary

Component Status Details
Metrics (WER/CER) ✅ Working Perfect match WER=0.0, error case WER=0.3333
Dataset Module ✅ Working 6 languages configured, 5 financial domains
HuggingFace Builder ✅ Working 20 multilingual samples ready for benchmarking
Report Generator ✅ Working JSON report structure verified

…king

Add whisper.cpp as git submodule with setup documentation and minimal
evaluation runner for AI-167 (multilingual support evaluation).

Changes:
- evaluation/whisper.cpp: C/C++ Whisper for on-device mobile inference
- evaluation/SETUP.md: Build and setup instructions
- evaluation/run_multilingual_eval.py: Thin wrapper to run benchmarks

This enables evaluation of whisper-tiny, whisper-base, and whisper-small
across 5 languages (en, hi, es, fr, de) with focus on mobile performance
metrics: WER, CER, latency, model size, memory usage.

Addresses: AI-167
Add docstrings and section headers explaining:
- Why whisper.cpp was chosen over Python/HuggingFace (mobile focus)
- Key design decisions (minimal code, validate state, use existing tools)
- Collaborative refinement with Claude Opus 4.5

This makes the contribution more human and transparent about the thought
process, not just presenting final code.
…dback

Per Pronay Sarker's guidance, add comprehensive results documentation that includes:
- Jira ticket reference (AI-167)
- Repository link for mentor review and reproduction
- Reproduction steps
- Results template (WER/CER, latency, model size)
- Findings section for analysis
- Next steps for follow-up work

This provides mentors with a single source of truth for:
1. How to access the work
2. How to reproduce it
3. Where findings will be documented
Added a note about using AI model Claude Opus 4.6 for refinement.
Refine script with insights from Claude Opus 4.5 and related benchmarks.
Modified the benchmarking process to iterate over multiple models, running benchmarks for each model individually.
- Add llms.txt for Claude/AI tool compatibility
- Add SKILL.md defining /benchmark-speech and /evaluate-language workflows
- Add Deepgram STT provider wrapper with async streaming
- Add WER/CER metrics calculation for accuracy measurement
- Add language targets (English, Spanish, Hindi, Swahili, French, Portuguese)

Total: 426 lines across 5 focused files
@aksh08022006 aksh08022006 requested review from a team and Copilot March 10, 2026 18:52

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds foundational pieces for multilingual speech evaluation, including a Deepgram STT provider wrapper, accuracy metrics, and dataset language definitions, plus supporting documentation and a whisper.cpp evaluation submodule.

Changes:

  • Added Deepgram async transcription provider scaffold under benchmarking_experiments/.
  • Implemented basic WER/CER computation utilities and a small language/domain reference dataset.
  • Added evaluation docs/scripts (whisper.cpp submodule + runner) and AI-facing project documentation (skills + llms.txt).

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
llms.txt Adds high-level project + evaluation documentation for AI tooling.
evaluation/whisper.cpp Adds whisper.cpp as a git submodule pointer for evaluations.
evaluation/run_multilingual_eval.py Introduces a runner script intended to orchestrate whisper.cpp benchmarks.
evaluation/SETUP.md Documents how to initialize/build whisper.cpp and run the evaluation script.
evaluation/RESULTS.md Adds a results template for AI-167 whisper.cpp multilingual evaluation.
benchmarking_experiments/providers/deepgram.py Adds an async Deepgram STT provider wrapper for AI-172 experiments.
benchmarking_experiments/metrics/accuracy.py Adds text normalization + edit-distance-based WER/CER helpers.
benchmarking_experiments/dataset/ai172_languages.py Defines target languages/domains and sample reference phrases.
SKILL.md Documents intended “skills”/workflows for benchmarking automation.
PR_AI172_FOUNDATION.md Adds a PR summary doc and basic import-based “testing” notes.
.gitmodules Registers whisper.cpp as a submodule.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread benchmarking_experiments/providers/deepgram.py Outdated
Comment thread benchmarking_experiments/providers/deepgram.py Outdated
Comment thread benchmarking_experiments/providers/deepgram.py
Comment thread benchmarking_experiments/providers/deepgram.py
Comment thread evaluation/run_multilingual_eval.py
Comment thread evaluation/run_multilingual_eval.py Outdated
Comment thread evaluation/run_multilingual_eval.py Outdated
Comment thread benchmarking_experiments/metrics/accuracy.py Outdated
Comment on lines +85 to +86
_, insertions, deletions, substitutions = edit_distance(list(ref_chars), list(hyp_chars))
cer = (substitutions + deletions + insertions) / len(ref_chars)

Copilot AI Mar 10, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This CER implementation includes spaces as characters because normalize_text() returns a space-delimited string. CER is typically computed over characters excluding whitespace (otherwise spacing differences can dominate the score). Consider removing spaces before list(...) (e.g., stripping all whitespace after normalization) so CER better reflects character recognition.

Copilot uses AI. Check for mistakes.
Comment thread evaluation/RESULTS.md Outdated
- Use relative import in deepgram.py (from .base)
- Use list + join for audio buffer (avoid O(n²) concat)
- Convert Deepgram SDK response to dict before parsing
- Strip whitespace in CER calculation for accurate character comparison
- Pass model and language params to bench.py in eval runner
- Update docstring to match actual implementation behavior
- Fix grammar in RESULTS.md header
@aksh08022006 aksh08022006 requested a review from Copilot March 10, 2026 19:10

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@staru09

staru09 commented Mar 11, 2026

Copy link
Copy Markdown
Member

Remove files (llms.txt, skills.md) etc cause we don't want these as of now.

Per mentor feedback (Aru Sharma):
- Create benchmarking dataset first (Hindi, French, Portuguese, English)
- Upload to HuggingFace for community reuse
- Output metrics report showing per-language performance

Added:
- dataset/create_hf_dataset.py: HF-compatible dataset with ground truth
  transcriptions, metadata.csv, dataset card, and upload to HF Hub
- generate_report.py: Runs evaluation and outputs per-language
  WER/CER/MOS report (JSON + human-readable table)
- requirements.txt: Added deepgram-sdk, huggingface_hub, datasets
staru09: 'Remove files (llms.txt, skills.md) etc cause we don't want these as of now'
@aksh08022006

Copy link
Copy Markdown
Author

Remove files (llms.txt, skills.md) etc cause we don't want these as of now.

Yeah Sure , Removed llms.txt, skills.md

@aksh08022006

Copy link
Copy Markdown
Author

@openMF/ai-community-maintainer, please re-review and letme know if any more changes are required.

@staru09

staru09 commented Mar 11, 2026

Copy link
Copy Markdown
Member

we don't want to keep whisper.cpp submodule
can we do it without a submodule ??

@DavidH-1

Copy link
Copy Markdown
Collaborator

CLA check ok

@aksh08022006

Copy link
Copy Markdown
Author

we don't want to keep whisper.cpp submodule can we do it without a submodule ??

I (A) make whisper.cpp optional and add a small tools/install_whisper_cpp.sh (recommended, quick), or (B) remove it entirely and use Docker/prebuilt binaries instead?
which one do you suggest if we dont want it i can make a follow up commit for B

- All core modules tested and working locally
- Metrics (WER/CER) calculation verified
- Multilingual dataset structure confirmed (6 languages, 20 samples)
- Report generator functional
- Test script and results for future reference
@aksh08022006 aksh08022006 requested a review from Copilot March 14, 2026 14:42

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 9 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread test_pr.py
Comment thread .gitmodules Outdated
Comment thread benchmarking_experiments/providers/deepgram.py
Comment thread benchmarking_experiments/providers/deepgram.py Outdated
Comment thread benchmarking_experiments/generate_report.py
Comment thread benchmarking_experiments/generate_report.py
Comment thread benchmarking_experiments/generate_report.py
Comment thread benchmarking_experiments/dataset/create_hf_dataset.py Outdated
Comment thread benchmarking_experiments/dataset/ai172_languages.py
…coverable, remove submodule, fix Deepgram options, make generate_report dry-run friendly, encode README utf-8, align languages
@staru09

staru09 commented Mar 17, 2026

Copy link
Copy Markdown
Member

I see a bunch of cache files in this, remove them
clean up any extra files then I think I'll be able to review better
and is this related to whisper.cpp or deepgram?

@DavidH-1

Copy link
Copy Markdown
Collaborator

Why did you do a commit or each file delete would have been clean to do a single commit incorporating them all and saved all the email notifications 😬

@aksh08022006

Copy link
Copy Markdown
Author

I see a bunch of cache files in this, remove them clean up any extra files then I think I'll be able to review better and is this related to whisper.cpp or deepgram?

AI-172 is for Deepgram, there was a branch mess up between the two tasks AI-167 was for whisper implementation and AI-172 is for deepgram .
i have removed the cache files
please letme know if anything is required

@aksh08022006

Copy link
Copy Markdown
Author

Why did you do a commit or each file delete would have been clean to do a single commit incorporating them all and saved all the email notifications 😬

yeah sorry i was trying to clean it flie by file ,letme create one commit to serve the complete purpose.

@aksh08022006

Copy link
Copy Markdown
Author

@staru09 you can now review can ignore whisper.cpp

@staru09

staru09 commented Mar 27, 2026

Copy link
Copy Markdown
Member

why it's called smoke_test?
rest looks good.

@aksh08022006

Copy link
Copy Markdown
Author

why it's called smoke_test? rest looks good.

smoke test usually means surface testing before complete benchmarking.
but if you want i can change that to something else

@staru09

staru09 commented Apr 2, 2026

Copy link
Copy Markdown
Member

resolve merge conflict then I can merge this

@aksh08022006

Copy link
Copy Markdown
Author

resolve merge conflict then I can merge this

done , solved merge conflict , you can now merge .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants