Skip to content

lab-rasool/speech_transcription

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Speech Transcription with Speaker Diarization

Combines NVIDIA NeMo ASR with pyannote.audio speaker diarization for multi-speaker audio transcription. Optimized for long audio files with intelligent chunking and GPU memory management.

Features

  • Long Audio Support: Handles audio files of any length via intelligent chunking
  • Memory Optimized: GPU cache clearing after each chunk prevents OOM errors
  • Smart Chunking: Splits audio at silence boundaries for clean transcriptions
  • Diarization Chunking: Long audio diarization with 60s overlap for speaker continuity
  • Speaker Merging: Automatic speaker remapping across chunks using temporal patterns
  • Parallel Processing: Configurable worker threads for batch processing

Key Improvements (vs Original)

Feature Original This Version
Max Chunk Duration 24 min 10 min (configurable)
Batch Size 8 4 (memory safe)
Diarization Single pass Chunked with merging
GPU Memory No management Auto-cleared per chunk
Split Algorithm Could hang Guaranteed termination

Prerequisites

  • Python 3.8+
  • NVIDIA GPU with CUDA 11.8+ (recommended)
  • HuggingFace account and API token

Getting HuggingFace Access

  1. Create a HuggingFace account at https://huggingface.co/join
  2. Get your API token at https://huggingface.co/settings/tokens (create a new token with read access)
  3. Request access to these models (click the links and accept the terms):
  4. Wait for approval (usually instant)

Installation

Step 1: Clone the Repository

git clone https://github.com/lab-rasool/speech_transcription.git
cd speech_transcription

Or download the ZIP from GitHub and extract it.

Step 2: Create Environment

Option A: Using Conda (recommended)

conda create -n speech_transcription python=3.11
conda activate speech_transcription

Option B: Using venv

python -m venv venv
source venv/bin/activate  # Linux/Mac

Step 3: Install PyTorch with CUDA

Install PyTorch with GPU support before other dependencies:

# For CUDA 12.4
pip install torch --index-url https://download.pytorch.org/whl/cu124

# For CUDA 11.8
pip install torch --index-url https://download.pytorch.org/whl/cu118

Check your CUDA version with nvidia-smi and choose accordingly. See pytorch.org for other options.

Step 4: Install Dependencies

pip install -r requirements.txt

Step 5: Verify Installation

python -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}')"

Expected output: PyTorch: 2.x.x+cu124, CUDA: True

Step 6: Configure Environment

cp .env.example .env

Edit .env and add your HuggingFace token (get one at https://huggingface.co/settings/tokens):

HUGGINGFACE_TOKEN=hf_your_token_here  # Required for pyannote
ASR_BATCH_SIZE=4                       # Reduce if OOM errors
CHUNK_DURATION_SECONDS=600             # 10 minutes (reduce for less GPU memory)
NUM_WORKERS=1                          # Increase for parallel file processing

Setting Number of Speakers

Adjust speaker detection in your .env file based on your use case:

For interviews (2 people):

DIARIZATION_MIN_SPEAKERS=2
DIARIZATION_MAX_SPEAKERS=2

For meetings (multiple people):

DIARIZATION_MIN_SPEAKERS=2
DIARIZATION_MAX_SPEAKERS=10

For unknown number of speakers:

DIARIZATION_MIN_SPEAKERS=1
DIARIZATION_MAX_SPEAKERS=10
Use Case MIN MAX Notes
1-on-1 Interview 2 2 Most accurate for known 2-person conversations
Small Meeting 2 6 Typical team meetings
Large Meeting 2 10 Conferences, group discussions
Unknown 1 10 Let pyannote decide automatically

Tip: Setting MAX_SPEAKERS slightly higher than expected is better than too low. However, accuracy decreases with more speakers (harder to distinguish voices) and processing time increases. MAX_SPEAKERS can be set to more than 10.

Usage

  1. Place your audio files in the data/audio/ folder inside the repo
  2. Run transcription:
python main.py

Supported formats: .wav, .mp3, .flac, .m4a, .ogg, .webm

Output

Results saved to results/ directory:

  • {filename}_complete.json - Full metadata, transcription, diarization, speaker stats
  • {filename}_transcript.txt - Human-readable transcript with speaker labels
  • {filename}_statistics.json - Processing metrics and speaker statistics

Architecture

Audio File
    |
    v
[Duration Check]
    |
    +---> Short (<10 min): Direct transcription
    |
    +---> Long (>10 min): Intelligent Chunking
              |
              v
         [Chunk at silence boundaries]
              |
              v
         [Transcribe each chunk]
              |
              v
         [Stitch transcriptions]
    |
    v
[Speaker Diarization]
    |
    +---> Short (<20 min): Direct diarization
    |
    +---> Long (>20 min): Chunked with 60s overlap
              |
              v
         [Merge & remap speakers]
    |
    v
[Align text with speakers]
    |
    v
[Save results]

Troubleshooting

HuggingFace 403 Error: Request access to both pyannote models on HuggingFace

CUDA OOM:

  • Reduce ASR_BATCH_SIZE to 2
  • Reduce CHUNK_DURATION_SECONDS to 300 (5 min)
  • Set NUM_WORKERS=1

Infinite loop in chunking: Fixed in this version with max iteration limit

Credits

Based on maaz328/speech-diarization-transcription with modifications for long audio support and memory optimization.

License

MIT

About

Multi-speaker audio transcription with NeMo ASR + pyannote diarization. Handles long audio files via intelligent chunking.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages