Speech Transcription with Speaker Diarization

Combines NVIDIA NeMo ASR with pyannote.audio speaker diarization for multi-speaker audio transcription. Optimized for long audio files with intelligent chunking and GPU memory management.

Features

Long Audio Support: Handles audio files of any length via intelligent chunking
Memory Optimized: GPU cache clearing after each chunk prevents OOM errors
Smart Chunking: Splits audio at silence boundaries for clean transcriptions
Diarization Chunking: Long audio diarization with 60s overlap for speaker continuity
Speaker Merging: Automatic speaker remapping across chunks using temporal patterns
Parallel Processing: Configurable worker threads for batch processing

Key Improvements (vs Original)

Feature	Original	This Version
Max Chunk Duration	24 min	10 min (configurable)
Batch Size	8	4 (memory safe)
Diarization	Single pass	Chunked with merging
GPU Memory	No management	Auto-cleared per chunk
Split Algorithm	Could hang	Guaranteed termination

Prerequisites

Python 3.8+
NVIDIA GPU with CUDA 11.8+ (recommended)
HuggingFace account and API token

Getting HuggingFace Access

Create a HuggingFace account at https://huggingface.co/join
Get your API token at https://huggingface.co/settings/tokens (create a new token with read access)
Request access to these models (click the links and accept the terms):
- https://huggingface.co/pyannote/speaker-diarization-3.1
- https://huggingface.co/pyannote/segmentation-3.0
Wait for approval (usually instant)

Installation

Step 1: Clone the Repository

git clone https://github.com/lab-rasool/speech_transcription.git
cd speech_transcription

Or download the ZIP from GitHub and extract it.

Step 2: Create Environment

Option A: Using Conda (recommended)

conda create -n speech_transcription python=3.11
conda activate speech_transcription

Option B: Using venv

python -m venv venv
source venv/bin/activate  # Linux/Mac

Step 3: Install PyTorch with CUDA

Install PyTorch with GPU support before other dependencies:

# For CUDA 12.4
pip install torch --index-url https://download.pytorch.org/whl/cu124

# For CUDA 11.8
pip install torch --index-url https://download.pytorch.org/whl/cu118

Check your CUDA version with nvidia-smi and choose accordingly. See pytorch.org for other options.

Step 4: Install Dependencies

pip install -r requirements.txt

Step 5: Verify Installation

python -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}')"

Expected output: PyTorch: 2.x.x+cu124, CUDA: True

Step 6: Configure Environment

cp .env.example .env

Edit .env and add your HuggingFace token (get one at https://huggingface.co/settings/tokens):

HUGGINGFACE_TOKEN=hf_your_token_here  # Required for pyannote
ASR_BATCH_SIZE=4                       # Reduce if OOM errors
CHUNK_DURATION_SECONDS=600             # 10 minutes (reduce for less GPU memory)
NUM_WORKERS=1                          # Increase for parallel file processing

Setting Number of Speakers

Adjust speaker detection in your .env file based on your use case:

For interviews (2 people):

DIARIZATION_MIN_SPEAKERS=2
DIARIZATION_MAX_SPEAKERS=2

For meetings (multiple people):

DIARIZATION_MIN_SPEAKERS=2
DIARIZATION_MAX_SPEAKERS=10

For unknown number of speakers:

DIARIZATION_MIN_SPEAKERS=1
DIARIZATION_MAX_SPEAKERS=10

Use Case	MIN	MAX	Notes
1-on-1 Interview	2	2	Most accurate for known 2-person conversations
Small Meeting	2	6	Typical team meetings
Large Meeting	2	10	Conferences, group discussions
Unknown	1	10	Let pyannote decide automatically

Tip: Setting MAX_SPEAKERS slightly higher than expected is better than too low. However, accuracy decreases with more speakers (harder to distinguish voices) and processing time increases. MAX_SPEAKERS can be set to more than 10.

Usage

Place your audio files in the data/audio/ folder inside the repo
Run transcription:

python main.py

Supported formats: .wav, .mp3, .flac, .m4a, .ogg, .webm

Output

Results saved to results/ directory:

{filename}_complete.json - Full metadata, transcription, diarization, speaker stats
{filename}_transcript.txt - Human-readable transcript with speaker labels
{filename}_statistics.json - Processing metrics and speaker statistics

Architecture

Audio File
    |
    v
[Duration Check]
    |
    +---> Short (<10 min): Direct transcription
    |
    +---> Long (>10 min): Intelligent Chunking
              |
              v
         [Chunk at silence boundaries]
              |
              v
         [Transcribe each chunk]
              |
              v
         [Stitch transcriptions]
    |
    v
[Speaker Diarization]
    |
    +---> Short (<20 min): Direct diarization
    |
    +---> Long (>20 min): Chunked with 60s overlap
              |
              v
         [Merge & remap speakers]
    |
    v
[Align text with speakers]
    |
    v
[Save results]

Troubleshooting

HuggingFace 403 Error: Request access to both pyannote models on HuggingFace

CUDA OOM:

Reduce ASR_BATCH_SIZE to 2
Reduce CHUNK_DURATION_SECONDS to 300 (5 min)
Set NUM_WORKERS=1

Infinite loop in chunking: Fixed in this version with max iteration limit

Credits

Based on maaz328/speech-diarization-transcription with modifications for long audio support and memory optimization.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
config		config
data/audio		data/audio
results		results
src		src
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Speech Transcription with Speaker Diarization

Features

Key Improvements (vs Original)

Prerequisites

Getting HuggingFace Access

Installation

Step 1: Clone the Repository

Step 2: Create Environment

Step 3: Install PyTorch with CUDA

Step 4: Install Dependencies

Step 5: Verify Installation

Step 6: Configure Environment

Setting Number of Speakers

Usage

Output

Architecture

Troubleshooting

Credits

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Speech Transcription with Speaker Diarization

Features

Key Improvements (vs Original)

Prerequisites

Getting HuggingFace Access

Installation

Step 1: Clone the Repository

Step 2: Create Environment

Step 3: Install PyTorch with CUDA

Step 4: Install Dependencies

Step 5: Verify Installation

Step 6: Configure Environment

Setting Number of Speakers

Usage

Output

Architecture

Troubleshooting

Credits

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages