Skip to content

Curation

David Wood edited this page Aug 29, 2022 · 11 revisions

Data Curation

Data curation is the process of building a training data set - a set of sounds and the associated labeling for the sounds, either for a whole wav file or for one or more segments of a file. Then end goal of this process is to have set of sounds files that are comprehensive, (i.e., capture all/most states of the environment that have an associated acoustic signature) and a set of accurate labels assigned to those sound files (or segments). In addition to creating an initial training set, a model is typically grown over time by incorporating new labeled sounds to provide broader coverage of the environment being monitored and thus a better model. See Labeling for a discussion of the format used to define labels over files and segments.

Labeling with Audacity

Audacity is an open source UI tool used to work with audio files. Among many of its capabilities are playback, visualization and labeling of audio segments. Our focus here is on using Audacity to define labels and convert them into the metadata.csv file format required by the CLI tools.

After opening an audio file (and viewing the spectrogram), we can apply label values as shown below:

In short:

  1. Open audacity on a selected wav file (myaudio.wav).
  2. Click in the spectrogram and drag to select/identify a segment of the sound.
  3. Then use the Edit->Label->Add option or it short-cut to create and label the selected segment.
  4. Use the Export->Labels option to save the labels as a text file to disk (e.g. myaudio.txt).
  5. Use the audacity2metadata CLI to convert the Audacity labels file to a metadata.csv file:
    • audacity2metadata -label state -wav myaudio.wav -audacity myaudio.txt > metadata.csv
    • Note: this requires python 3. If you're default python command is not python 3, set the PYTHON env var to be the name of your python 3 executable (e.g. export PYTHON=python3.8).
  6. Use the sound-info CLI to test the metadata file:
    • sound-info -sounds metadata.csv

Training the First Model and Identify Labeling Errors

After creating your first training set (sound files and metadata.csv file) and tuning your model (mymodel.js) using the evaluate tool, you will want to train your first model. After creating this first model it can be used to help identify potential errors in labeling. Errors might include missing labels, incorrect labels or problems with the size of the segments to which the labels apply. These can be identified by using the initial model to classify the training data while looking for inconsistencies between what the model produces relative to the defined labeling. This process might look as follows using the CLI tools

First, use the train tool to create a model and store it in engine.cfr.

train -sounds metadata.csv -clipLen 1000 -model mymodel.js -label source -output mymodel.cfr

This trained model can then be used to classify the training data, using the classify tool,

classify  -sounds metadata.csv -clipLen 1000 -file mymodel.cfr -compare | grep '!='

This produces one line for each clip that was classified by the model differently than the ground truth labeling. For example,

mywav.wav[1000-2000]: class=voice(!=ambient) confidence=0.4850
ambient.wav[2000-3000]: class=click(!=ambient) confidence=0.3828

For a perfect model with 100% accurate labeling, the output for a perfect model would be empty, such that the model matches exactly with the ground truth. In practice however, for large data sets the models are generally not 100% accurate on perfectly labeled training data, so some mis-labelings may always be present. With the above potential mis-labelings, one can now go back and review them using Audacity. If errors are found, then the Audacity labels file will need to be imported, have edits to the labels applied, and then re-saved. If any modifications are made to any Audacity label files then the audacity2metadata tool will need to be rerun (see above) to regenerate the metadata.csv file.

Automatically Labeling New Unlabeled Data

The label CLI tool can label new unlabeled data. It uses a trained model to first label segments of a sound, and then groups adjacent segments having the same labels to produce output in metadata.csv style format. For example, to label a new sound using the model trained above using the same clip length as was used to train the model:

label -file mymodel.cfr -clipLen 1000 mynewsound1.wav mynewsound2.wav

might produce the following:

mynewsound1.wav[0-1000],source=engine,
mynewsound1.wav[1000-4000],source=compressor,
mynewsound1.wav[4000-5000],source=machine,
mynewsound2.wav[0-4000],source=compressor, 
mynewsound2.wav[4000-5000],source=machine, 

Note that all segments are integer multiples of the clip length used. At this point one really needs to review the model-applied labeling to atest to the correctness. Simple mis-labelings can occur but also segment boundaries may need to be adjust to more accurately match correct labels. To review these labels in Audacity, they must be converted to Audacity-formatted label files - one for each sound file. The metadata2audacity CLI tool is used to make this conversion. For example, assuming the above output of label was saved into the file mynewsounds.csv,

metadata2audacity mynewsounds.csv
Created mynewsound1.txt
Created mynewsound2.txt

Then for each sound, open it in Audacity and import and review the associated labels file (e.g., mynewsound1.txt). If any changes are made to the Audacity label files, then use the audacity2metadata tool to regenerate a new and improved csv label file (e.g. mynewsounds-reviewed.csv). For example,

for i in mynewsound*.txt; do
   wav=$(echo $i | sed -e 's/txt/wav/')
   audacity2metadata -wav $wav -audacity $i -label source >> mynewsounds-reviewed.csv
done
Clone this wiki locally