AI-172: Foundation for Deepgram multilingual evaluation#99
AI-172: Foundation for Deepgram multilingual evaluation#99aksh08022006 wants to merge 28 commits into
Conversation
…king Add whisper.cpp as git submodule with setup documentation and minimal evaluation runner for AI-167 (multilingual support evaluation). Changes: - evaluation/whisper.cpp: C/C++ Whisper for on-device mobile inference - evaluation/SETUP.md: Build and setup instructions - evaluation/run_multilingual_eval.py: Thin wrapper to run benchmarks This enables evaluation of whisper-tiny, whisper-base, and whisper-small across 5 languages (en, hi, es, fr, de) with focus on mobile performance metrics: WER, CER, latency, model size, memory usage. Addresses: AI-167
Add docstrings and section headers explaining: - Why whisper.cpp was chosen over Python/HuggingFace (mobile focus) - Key design decisions (minimal code, validate state, use existing tools) - Collaborative refinement with Claude Opus 4.5 This makes the contribution more human and transparent about the thought process, not just presenting final code.
…dback Per Pronay Sarker's guidance, add comprehensive results documentation that includes: - Jira ticket reference (AI-167) - Repository link for mentor review and reproduction - Reproduction steps - Results template (WER/CER, latency, model size) - Findings section for analysis - Next steps for follow-up work This provides mentors with a single source of truth for: 1. How to access the work 2. How to reproduce it 3. Where findings will be documented
Added a note about using AI model Claude Opus 4.6 for refinement.
Refine script with insights from Claude Opus 4.5 and related benchmarks.
Modified the benchmarking process to iterate over multiple models, running benchmarks for each model individually.
- Add llms.txt for Claude/AI tool compatibility - Add SKILL.md defining /benchmark-speech and /evaluate-language workflows - Add Deepgram STT provider wrapper with async streaming - Add WER/CER metrics calculation for accuracy measurement - Add language targets (English, Spanish, Hindi, Swahili, French, Portuguese) Total: 426 lines across 5 focused files
There was a problem hiding this comment.
Pull request overview
Adds foundational pieces for multilingual speech evaluation, including a Deepgram STT provider wrapper, accuracy metrics, and dataset language definitions, plus supporting documentation and a whisper.cpp evaluation submodule.
Changes:
- Added Deepgram async transcription provider scaffold under
benchmarking_experiments/. - Implemented basic WER/CER computation utilities and a small language/domain reference dataset.
- Added evaluation docs/scripts (whisper.cpp submodule + runner) and AI-facing project documentation (skills + llms.txt).
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| llms.txt | Adds high-level project + evaluation documentation for AI tooling. |
| evaluation/whisper.cpp | Adds whisper.cpp as a git submodule pointer for evaluations. |
| evaluation/run_multilingual_eval.py | Introduces a runner script intended to orchestrate whisper.cpp benchmarks. |
| evaluation/SETUP.md | Documents how to initialize/build whisper.cpp and run the evaluation script. |
| evaluation/RESULTS.md | Adds a results template for AI-167 whisper.cpp multilingual evaluation. |
| benchmarking_experiments/providers/deepgram.py | Adds an async Deepgram STT provider wrapper for AI-172 experiments. |
| benchmarking_experiments/metrics/accuracy.py | Adds text normalization + edit-distance-based WER/CER helpers. |
| benchmarking_experiments/dataset/ai172_languages.py | Defines target languages/domains and sample reference phrases. |
| SKILL.md | Documents intended “skills”/workflows for benchmarking automation. |
| PR_AI172_FOUNDATION.md | Adds a PR summary doc and basic import-based “testing” notes. |
| .gitmodules | Registers whisper.cpp as a submodule. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| _, insertions, deletions, substitutions = edit_distance(list(ref_chars), list(hyp_chars)) | ||
| cer = (substitutions + deletions + insertions) / len(ref_chars) |
There was a problem hiding this comment.
This CER implementation includes spaces as characters because normalize_text() returns a space-delimited string. CER is typically computed over characters excluding whitespace (otherwise spacing differences can dominate the score). Consider removing spaces before list(...) (e.g., stripping all whitespace after normalization) so CER better reflects character recognition.
- Use relative import in deepgram.py (from .base) - Use list + join for audio buffer (avoid O(n²) concat) - Convert Deepgram SDK response to dict before parsing - Strip whitespace in CER calculation for accurate character comparison - Pass model and language params to bench.py in eval runner - Update docstring to match actual implementation behavior - Fix grammar in RESULTS.md header
|
Remove files (llms.txt, skills.md) etc cause we don't want these as of now. |
Per mentor feedback (Aru Sharma): - Create benchmarking dataset first (Hindi, French, Portuguese, English) - Upload to HuggingFace for community reuse - Output metrics report showing per-language performance Added: - dataset/create_hf_dataset.py: HF-compatible dataset with ground truth transcriptions, metadata.csv, dataset card, and upload to HF Hub - generate_report.py: Runs evaluation and outputs per-language WER/CER/MOS report (JSON + human-readable table) - requirements.txt: Added deepgram-sdk, huggingface_hub, datasets
staru09: 'Remove files (llms.txt, skills.md) etc cause we don't want these as of now'
Yeah Sure , Removed llms.txt, skills.md |
|
@openMF/ai-community-maintainer, please re-review and letme know if any more changes are required. |
|
we don't want to keep whisper.cpp submodule |
|
CLA check ok |
I (A) make whisper.cpp optional and add a small tools/install_whisper_cpp.sh (recommended, quick), or (B) remove it entirely and use Docker/prebuilt binaries instead? |
- All core modules tested and working locally - Metrics (WER/CER) calculation verified - Multilingual dataset structure confirmed (6 languages, 20 samples) - Report generator functional - Test script and results for future reference
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 15 out of 15 changed files in this pull request and generated 9 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…coverable, remove submodule, fix Deepgram options, make generate_report dry-run friendly, encode README utf-8, align languages
|
I see a bunch of cache files in this, remove them |
|
Why did you do a commit or each file delete would have been clean to do a single commit incorporating them all and saved all the email notifications 😬 |
AI-172 is for Deepgram, there was a branch mess up between the two tasks AI-167 was for whisper implementation and AI-172 is for deepgram . |
yeah sorry i was trying to clean it flie by file ,letme create one commit to serve the complete purpose. |
|
@staru09 you can now review can ignore whisper.cpp |
|
why it's called smoke_test? |
… all unique requirements
smoke test usually means surface testing before complete benchmarking. |
|
resolve merge conflict then I can merge this |
done , solved merge conflict , you can now merge . |
jira-https://mifosforge.jira.com/browse/AI-172
Adds Deepgram multilingual STT evaluation framework:
providers/deepgram.py— Deepgram API wrapper for async transcriptionmetrics/accuracy.py— WER/CER calculationdataset/ai172_languages.py— Language targets (Hindi, French, Portuguese, English)dataset/create_hf_dataset.py— HuggingFace dataset builder with ground truth transcriptions + upload scriptgenerate_report.py— Per-language metrics report (WER, CER, MOS)Languages: English, Hindi, French, Portuguese
Testing Evidence
All core functionality has been tested locally and is fully operational:
Test Results Summary