AURA Technical Stream Prosody-Based Emotion Classifier using HuBERT for emotion classification from speech audio.
Implements a prosody-based emotion classifier using the HuBERT (Hidden-Unit BERT) model from Hugging Face Transformers. It supports emotion classification on datasets like RAVDESS and CREMAD, mapping emotions to a common 7-label set (neutral, happy, angry, sad, disgust, fear, excited) and providing VAD (Valence-Arousal-Dominance) mappings.
- Python 3.8 or higher
- pip package manager
- Git
- Sufficient disk space for datasets (RAVDESS ~200MB, CREMAD ~1GB+)
If you have the repository as a Git repository:
git clone <repository-url>
cd Prosody-Emotion-ClassifierOtherwise, navigate to the project directory.
It's recommended to use a virtual environment to avoid dependency conflicts:
Windows:
python -m venv venv
venv\Scripts\activateLinux/Mac:
python3 -m venv venv
source venv/bin/activateInstall all required packages from the requirements file:
pip install -r requirements.txtThis will install:
- PyTorch and TorchAudio
- Transformers (Hugging Face)
- Librosa (audio processing)
- SoundFile (audio I/O)
- Pandas, NumPy, SciPy
- Scikit-learn
- And other dependencies
Note: If you have a CUDA-capable GPU and want to use it for training, you may need to install PyTorch with CUDA support separately.
You need to download the emotion speech datasets. This project supports RAVDESS and CREMAD datasets.
- Download from the link under data/RAVDESS and extract under data directory
Expected structure:
data/
└── RAVDESS/
├── Actor_01/
│ ├── 03-01-01-01-01-01-01.wav
│ ├── 03-01-01-01-01-02-01.wav
│ └── ...
├── Actor_02/
└── ...
- Download from the link under data/CREMAD and extract under data directory
Expected structure:
data/
└── CREMAD/
├── 1001_DFA_ANG_XX.wav
├── 1001_DFA_DIS_XX.wav
├── 1001_DFA_FEA_XX.wav
└── ...
Note: You can use one or both datasets. The project will work with either, but both provide better training data diversity.
After placing the audio files in the correct directories, generate metadata CSV files for each dataset:
For RAVDESS:
python src/make_metadata_ravdess.pyFor CREMAD:
python src/make_metadata_cremad.pyThis will create metadata.csv files in each dataset directory (data/RAVDESS/metadata.csv and data/CREMAD/metadata.csv).
The metadata files contain columns: utt_id, wav_path, speaker_id, emotion_label.
Generate split files that divide the data into training (70%), validation (15%), and test (15%) sets based on speaker IDs:
For RAVDESS:
python src/make_splits.py RAVDESSFor CREMAD:
python src/make_splits.py CREMADThis creates JSON files in configs/splits/ (RAVDESS_splits.json and CREMAD_splits.json) that map each utterance ID to its split (train/val/test).
Note: The splits are speaker-independent, meaning all utterances from a speaker are in the same split to prevent data leakage.
Ensure the following configuration files exist and are properly set up:
configs/hubert_base.json- Model training configurationconfigs/label_maps/dataset_to_common_7.json- Maps dataset-specific labels to common 7-label setconfigs/label_maps/common7_to_vad.json- Maps common labels to VAD values
These files should already be present in the repository. You can review and modify configs/hubert_base.json if you want to adjust training hyperparameters (batch size, learning rate, epochs, etc.).
Before training, verify your setup:
-
Check data directories:
data/RAVDESS/should contain actor folders with.wavfilesdata/CREMAD/should contain.wavfiles- Both should have
metadata.csvfiles
-
Check split files:
configs/splits/RAVDESS_splits.jsonshould exist (if using RAVDESS)configs/splits/CREMAD_splits.jsonshould exist (if using CREMAD)
-
Check Python packages:
python -c "import torch; import transformers; import librosa; print('All packages installed successfully!')"
Once setup is complete, you can train the model on your chosen dataset(s):
python src/train_hubert_cls.py --dataset RAVDESS --config configs/hubert_base.jsonpython src/train_hubert_cls.py --dataset CREMAD --config configs/hubert_base.jsonThe trained model will be saved in:
models/RAVDESS_hubert_cls/(for RAVDESS)models/CREMAD_hubert_cls/(for CREMAD)
Training outputs checkpoints after each epoch, and the best model (based on F1 score on validation set) is saved.
Note: Training can take several hours.
After training (or using a pre-trained model), you can run inference on audio files:
python src/infer_speech.py <path_to_audio_file.wav>This will output:
- Predicted emotion label
- Probability distribution over all emotion classes
- VAD (Valence-Arousal-Dominance) values
Example:
python src/infer_speech.py data/RAVDESS/Actor_01/03-01-01-01-01-01-01.wavProsody-Emotion-Classifier/
├── configs/ # Configuration files
│ ├── hubert_base.json # Training configuration
│ ├── label_maps/ # Label mapping files
│ └── splits/ # Train/val/test split files
├── data/ # Dataset directories (created by user)
│ ├── RAVDESS/ # RAVDESS dataset files
│ └── CREMAD/ # CREMAD dataset files
├── models/ # Trained models (created during training)
├── src/ # Source code
│ ├── dataset_audio.py # Dataset utilities
│ ├── infer_speech.py # Inference script
│ ├── make_metadata_*.py # Metadata generation scripts
│ ├── make_splits.py # Split generation script
│ └── train_hubert_cls.py # Training script
├── requirements.txt # Python dependencies
└── README.md # This file
Solution: Run the metadata generation scripts (Step 5) after downloading and placing datasets.
Solution: Run the split generation script (Step 6) after creating metadata files.
Solution: Reduce batch size in configs/hubert_base.json or use a smaller model. You can also train on CPU (slower but works).
Solution: Verify that:
- Datasets are extracted correctly
- Files are in the correct directory structure
- File paths in metadata.csv are correct (they should be relative paths)