Toolkit for Audiovisual Speaker Diarization in Noisy Environments, Speech Feature Extraction, and Well-Being Prediction

Repository for the master's thesis of Tobias Zeulner: Leveraging Speech Features for Automated Analysis of Well-Being in Teamwork Contexts

About this Project • Key Features • How To Use • License

📖 About this Project

Current methods for assessing employee well-being rely primarily on irregular and time-consuming surveys, which limits proactive measures and support. This thesis addresses this problem by developing predictive algorithms to automatically assess well-being. The algorithms are based on audio data collected in teamwork contexts. A dataset of 56 participants who worked in teams over a four-day period was curated. The positive emotion, engagement, relationships, meaning, and accomplishment (PERMA) framework consisting of five pillars ( developed by Seligman) was used to measure well-being. An audiovisual speaker diarization system was developed to enable the calculation of speech features at the individual level in a noisy environment. After extracting and selecting the most relevant features, regression, and classification algorithms were trained to predict well-being.

The best predictive model for each PERMA pillar is the two-class classification system. It achieves the following balanced accuracies: P: 78%, E: 50%, R: 74%, M: 61%, and A: 50%.

The entire pipeline (see image below) and final models are provided in this GitHub repository.

⭐ Key Features

The four main building blocks of this toolbox are shown in the figure below.

Input Video:
- mp4 or avi file
- Stored in src/audio/videos
- Filename provided in configs/config.yaml
- Ideally 25 fps (otherwise processing takes longer)
Output of Audiovisual Speaker Diarization:
- 1 folder with the same name as the video (src/audio/videos/VIDEONAME), containing all current and future results
- 3 important files in this folder:
  1. RTTM file (“who spoke when”)
  2. Log file (for troubleshooting)
  3. “faces_id” folder, which contains all recognized speaker and their corresponding ID from the RTTM file
Output of Communication Pattern & Emotion Feature Calculation:
- 1 csv file named "VIDEONAME_audio_analysis_results.csv" containing one row for each speaker with the corresponding features values over time as columns
Output of Feature Visualization:
- 3 line charts for visualization of the feature values contained in the csv file
- 3 features are plotted per chart (i.e., 9 time series in total)
Output of Well-Being Prediction:
- 1 csv file for the PERMA classification results (low/high well-being)
- 1 csv file for the PERMA regression results (continuous well-being scores either between 0-1 or 1-7)
- 1 plot to visualize the regression results (also saved as “perma_spider_charts.png”)

The parts can be run separately if, for example, the prediction of well-being is not required but other downstream tasks such as the prediction of team performance are.

If you wish to exclude an individual from the analysis (e.g. either random person in the background or no informed consent), you can do so by:

performing only step 1 of the pipeline.
deleting the person's image in the src/audio/videos/VIDEONAME/faces_id folder.
perform the remaining steps of the pipeline (2,3,4). From now on, the corresponding person will be excluded from the analysis

If you want to change the name of a person from the ID to the real name, you can do it as follows:

perform only step 1 of the pipeline
rename the corresponding file name in the folder src/audio/videos/VIDEONAME/faces_id by adding two underscores after the ID followed by the name (e.g. change the name from "2.jpg" to "2__john.jpg")
execute the remaining steps of the pipeline (2,3,4). From now on, the analysis will use the real name, not the ID

⚙️ How To Use

I recommend using the same Python version as me to avoid conflicts (3.8.10). After cloning this GitHub repo, I also recommend to set up a new virtual environment using the venv module (in the same directory as the "main.py" file).
How to set up a virtual environment in Python 3 using the venv module (Windows)
```
python -m venv venv
.\venv\Scripts\activate
```
How to set up a virtual environment in Python 3 using the venv module (MacOS/Linux)
```
python3 -m venv venv
source venv/bin/activate
```
Then, install ffmpeg (which is needed to process the video recordings).
How to install ffmpeg on Windows/Linux/MacOS
Install the required packages:
```
pip install -r requirements.txt
```
or
```
pip3 install -r requirements.txt
```
depending on your Python installation.
To process a video using this tool, follow the steps below (if you use it for the first time, you can leave the initial value in the configuration file (001) and go directly to the next step):
1. Video Placement: Place the video you wish to process in the src/audio/videos directory. Ensure that the video file is in a format compatible with the project (mp4 or avi).
2. Configuration File: Open the configs/config.yaml file. This file contains various parameters that control the processing of the video.
3. Video Specification: In the configuration file, specify the filename of the video you placed in the src/audio/videos directory. Do not include the file extension in the filename. For instance, if your video file is called "my_video.mp4", you should enter "my_video".
4. Parameter Adjustment: Review the other parameters in the configuration file. These parameters control various aspects of the video processing, and you may adjust them as necessary to suit your specific needs.
Run the main file:
```
python main.py
```
or
```
python3 main.py
```
depending on your Python installation.

Notes:

Should you face errors indicating a problem with multiprocessing, then set "MULTIPROCESSING" to "False" in the Config file
When running the script for the first time, all required machine learning models will be downloaded automatically
Running the script on a GPU can accelerate the runtime by a factor of 4x-8x (the script recognizes automatically if a CUDA device is available) - due to Pytorch development, there is only support for NVIDEA GPUs (no M1/M2 GPUs)

Have fun! 😎

If you encounter any issues, please reach out to me or open a new issue.

📄 License

Distributed under the MIT License. See LICENSE for more information.

Email: [email protected] · LinkedIn: Tobias Zeulner

Name		Name	Last commit message	Last commit date
Latest commit History 264 Commits
configs		configs
docs		docs
external		external
src/audio		src/audio
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
run_multiple_teams.py		run_multiple_teams.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Toolkit for Audiovisual Speaker Diarization in Noisy Environments, Speech Feature Extraction, and Well-Being Prediction

📖 About this Project

⭐ Key Features

⚙️ How To Use

📄 License

About

Releases

Packages

Languages

License

Zeulni/wellbeing-audio-analysis

Folders and files

Latest commit

History

Repository files navigation

Toolkit for Audiovisual Speaker Diarization in Noisy Environments, Speech Feature Extraction, and Well-Being Prediction

📖 About this Project

⭐ Key Features

⚙️ How To Use

📄 License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages