GitHub - moderatelyfunctional/Obama-Speech-Dataset

Obama Speech Dataset

Running The Script

The dataset here contains Obama's speeches from The Obama White House YouTube channel.

To run the script, first install FFmpeg on your operating system. Then run

pip3 install -r requirements.txt
python process.py

Data Overview

In total, there's around 28GB of speech data spread out across 30K wav files that are 1-7 seconds long.

Script Overview

The script iterates through obama_speeches.csv, fetches each YouTube video, uses FFmpeg to convert it to audio. It then fetches the corresponding timestamps. Both are stored in the input_data directory.

Later, it creates a folder with the VIDEO_ID as the directory name in output_data and splices the data into a bunch of wav files. Each wav file is written into data.txt in the form output_data/video_id/\d+.wav|text of the wav.

Post Processing

Run trim.py to create train/val files from data.txt.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Obama Speech Dataset

Running The Script

Data Overview

Script Overview

Post Processing

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitignore		.gitignore
README.md		README.md
data.txt		data.txt
obama_speeches.csv		obama_speeches.csv
process.py		process.py
requirements.txt		requirements.txt
trim.py		trim.py

moderatelyfunctional/Obama-Speech-Dataset

Folders and files

Latest commit

History

Repository files navigation

Obama Speech Dataset

Running The Script

Data Overview

Script Overview

Post Processing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages