Combines NVIDIA NeMo ASR with pyannote.audio speaker diarization for multi-speaker audio transcription. Optimized for long audio files with intelligent chunking and GPU memory management.
- Long Audio Support: Handles audio files of any length via intelligent chunking
- Memory Optimized: GPU cache clearing after each chunk prevents OOM errors
- Smart Chunking: Splits audio at silence boundaries for clean transcriptions
- Diarization Chunking: Long audio diarization with 60s overlap for speaker continuity
- Speaker Merging: Automatic speaker remapping across chunks using temporal patterns
- Parallel Processing: Configurable worker threads for batch processing
| Feature | Original | This Version |
|---|---|---|
| Max Chunk Duration | 24 min | 10 min (configurable) |
| Batch Size | 8 | 4 (memory safe) |
| Diarization | Single pass | Chunked with merging |
| GPU Memory | No management | Auto-cleared per chunk |
| Split Algorithm | Could hang | Guaranteed termination |
- Python 3.8+
- NVIDIA GPU with CUDA 11.8+ (recommended)
- HuggingFace account and API token
- Create a HuggingFace account at https://huggingface.co/join
- Get your API token at https://huggingface.co/settings/tokens (create a new token with read access)
- Request access to these models (click the links and accept the terms):
- Wait for approval (usually instant)
git clone https://github.com/lab-rasool/speech_transcription.git
cd speech_transcriptionOr download the ZIP from GitHub and extract it.
Option A: Using Conda (recommended)
conda create -n speech_transcription python=3.11
conda activate speech_transcriptionOption B: Using venv
python -m venv venv
source venv/bin/activate # Linux/MacInstall PyTorch with GPU support before other dependencies:
# For CUDA 12.4
pip install torch --index-url https://download.pytorch.org/whl/cu124
# For CUDA 11.8
pip install torch --index-url https://download.pytorch.org/whl/cu118Check your CUDA version with nvidia-smi and choose accordingly. See pytorch.org for other options.
pip install -r requirements.txtpython -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}')"Expected output: PyTorch: 2.x.x+cu124, CUDA: True
cp .env.example .envEdit .env and add your HuggingFace token (get one at https://huggingface.co/settings/tokens):
HUGGINGFACE_TOKEN=hf_your_token_here # Required for pyannote
ASR_BATCH_SIZE=4 # Reduce if OOM errors
CHUNK_DURATION_SECONDS=600 # 10 minutes (reduce for less GPU memory)
NUM_WORKERS=1 # Increase for parallel file processingAdjust speaker detection in your .env file based on your use case:
For interviews (2 people):
DIARIZATION_MIN_SPEAKERS=2
DIARIZATION_MAX_SPEAKERS=2For meetings (multiple people):
DIARIZATION_MIN_SPEAKERS=2
DIARIZATION_MAX_SPEAKERS=10For unknown number of speakers:
DIARIZATION_MIN_SPEAKERS=1
DIARIZATION_MAX_SPEAKERS=10| Use Case | MIN | MAX | Notes |
|---|---|---|---|
| 1-on-1 Interview | 2 | 2 | Most accurate for known 2-person conversations |
| Small Meeting | 2 | 6 | Typical team meetings |
| Large Meeting | 2 | 10 | Conferences, group discussions |
| Unknown | 1 | 10 | Let pyannote decide automatically |
Tip: Setting MAX_SPEAKERS slightly higher than expected is better than too low. However, accuracy decreases with more speakers (harder to distinguish voices) and processing time increases. MAX_SPEAKERS can be set to more than 10.
- Place your audio files in the
data/audio/folder inside the repo - Run transcription:
python main.pySupported formats: .wav, .mp3, .flac, .m4a, .ogg, .webm
Results saved to results/ directory:
{filename}_complete.json- Full metadata, transcription, diarization, speaker stats{filename}_transcript.txt- Human-readable transcript with speaker labels{filename}_statistics.json- Processing metrics and speaker statistics
Audio File
|
v
[Duration Check]
|
+---> Short (<10 min): Direct transcription
|
+---> Long (>10 min): Intelligent Chunking
|
v
[Chunk at silence boundaries]
|
v
[Transcribe each chunk]
|
v
[Stitch transcriptions]
|
v
[Speaker Diarization]
|
+---> Short (<20 min): Direct diarization
|
+---> Long (>20 min): Chunked with 60s overlap
|
v
[Merge & remap speakers]
|
v
[Align text with speakers]
|
v
[Save results]
HuggingFace 403 Error: Request access to both pyannote models on HuggingFace
CUDA OOM:
- Reduce
ASR_BATCH_SIZEto 2 - Reduce
CHUNK_DURATION_SECONDSto 300 (5 min) - Set
NUM_WORKERS=1
Infinite loop in chunking: Fixed in this version with max iteration limit
Based on maaz328/speech-diarization-transcription with modifications for long audio support and memory optimization.
MIT