-
Notifications
You must be signed in to change notification settings - Fork 3
ModelTuning
The Ensemble algorithm's default configuration is intended to limit the need for a data scientist to create a good working model. But, in some cases model parameter tuning (aka hyper-parameter tuning) may be required. This can be done with insight into the data set, for example, knowing that a specific range of frequencies is most relevant.
In addition, to data set insights, it may be useful to use a brute force mechanism to try various combinations of model parameters. This is sometimes known as a grid search. The evaluate
CLI provides a mechanism to do this. One simply defines a JavaScript file that defines the algorithm/featuregram extraction combinations. Referring back to the (feature extraction pipeline](Classifier-Definition), there are a number of pieces that can be tried as follows:
- Feature Gram Extraction Parameters
- subwindow size (defined in msec)
- window hop size (defined in msec)
- Features (e.g., FFT, MFCC, MFFB, etc)
- Feature Processor (e.g., DeltaFeatureProcessor)
- Algorithms (e.g., GMM, LpNN, CNN, DCASE, etc)
Finally, most models benefit from a balanced training set (i.e., equal numbers of samples/duration for each label value). The train
and evaluate
CLI tools allow for balancing training data.
To capture essential characteristics of sounds, a feature extractor is applied to the raw waveform. The result from feature extraction is then used as the input of a classifier to predict the label (category) of the sound clip. A goal of feature extraction is to provide a feature vector that is much smaller than the raw waveform of the sound, which allows us to use classifiers with relatively small input size to capture essential sound information while avoiding overfitting. Note that a classifier with a very large input space is prone to overfitting because it has a too high degree of freedom, and also has high complexity which is undesirable.
Typical sound feature extractors include fast Fourier transform (FFT), mel-frequency filter bank (MFFB), LogMel, and mel-frequency cepstral coefficients (MFCC). FFT is the basic and first processing step of MFFB and MFCC, which generates a frequency-domain spectrum of the sound. Because most sound signals are equally sensitive on the log scale of frequencies (i.e., a human may notice an equal degree of difference for two sounds of 500Hz and 1000Hz and two other sounds of 1000Hz and 2000Hz), MFFB uses the result from FFT and converts the linear frequency scale to a logarithmic scale. In addition, MFFB applies a triangular filter to further reduce the feature vector length, where the resulting feature length is known as the number of frequency bands and can be specified using the numBands parameter. Typical values of numBands range from 20 to 200. The log sensitivity often applies to the amplitude as well, similar to frequency. To capture this, the LogMel feature extractor converts the power of each frequency band (obtained from MFFB) into the logarithmic scale, represented as decibels. Then, MFCC further includes a discrete cosine transform (DCT) step that is applied to the LogMel, to aggregate most information to the first few components of the feature vector.
In general, the degree of compression increases in the order of FFT -> MFFB -> LogMel -> MFCC. Having a highly compressed feature such as MFCC can be useful when there is only a small number of training data samples (to avoid overfitting the classifier) and when the sounds are "natural" (i.e., those that you'll hear in daily lives). Using a lowly compressed feature such as FFT or MFFB may be useful if you have a large number of training data samples and when the sounds are "unnatural" (e.g., the specific application requires our classifiers to be sensitive on linear amplitude, instead of log amplitude). In general, shallow models such as GMM and Lpnn, which require less data may best be configured to use MFCC, while deep models requiring more data such as CNN or DCASE, may be best configured with some form of MFFB or FFT.
Sub-windows are extracted and it is these sub-windows on which the features are extracted to create the feature/spectrogram. Both size and shift can be defined.
The sizes of the sub-windows can vary from as low as 20 msec to 500 msec (in general). The larger the size, the more low frequencies can be captured. So if you have important signatures at lower frequencies, you may want longer sub-windows.
The hop-size or window shift is the the distance between starting points of sequential windows in the spectrogram. Setting the hop-size to be equal to the window size results in perfectly aligned windows with no overlap. Common values are half the sub-window size (perfectly overlapping) or the window-size (non-overlapping) and values must be larger than 0 and generally less then or equal to the window size.
Most feature extractors allow the definition of a range of frequencies to represent in the features extracted. Often a range can be defined to focus the feature on the most relevant frequencies. For example, removing the bottom 200 hz may be useful when there is irrelevant background noise (e.g., ventilation systems) below 200 hz.
Generally each feature extractor allows the specification of the feature length (height of the spectrogram). A larger feature length will provide finer granularity of the feature (usually some form of frequency space). Generally, a minimum feature length of 32 is recommended with some model definitions using as high as 128. Longer features will impact memory requirements and extend training time. Additionally, large feature vectors will give models a high degree of freedom which may cause training to over fit if the training data set size is small.
Some features allow the computed feature vector to be "normalized" to a fixed range. This may be useful when volume changes (power of frequency spectrogram) is less important and can make the model more resilient to such changes.
Some features optionally allow taking the logarithm of the feature vector values. This can be important if the range of powers is very wide and can help de-emphasize these high power frequencies.
This generally does NOT need to be tuned but should be set to match the sampling rate of your sounds. The default is 44,100 hertz.
Feature Processing, an optional step, allows the manipulation of the computed spectrogram prior to training on or classifying the spectrogram. The DeltaFeatureProcessor is the primary processor available and allows the computation of _deltas _through time (aka, velocity and acceleration). This can be useful in capturing characteristics related to how the sounds change over neighboring subwindows. It is generally not tuned, but enabled or disabled, although control is available over which deltas are computed and their relative weights. A common configuration is DeltaFeatureProcessor(2, [1,1,1]) which uses a time window of 5 width, and equal weights applied to each of a) the original spectrogram, b) the velocity (1st time derivative) and c) the acceleration (2nd derivative). A wider differencing window may be recommended when using overlapping sub-windows.
A number of models both shallow and deep are available. In general, if you have a small amount of data (less then 2-3 minutes per label value), then a shallow model is likely best. Shallow models also tend to work best with a highly compressed feature such as MFCC. If you have a large amount of data (10 or more minutes per label value), then a deep model such as CNN or DCASE may give the best results. With deep models, a less compressed feature vector such as MFFB or FFT may be best. A short discussion on parameterizations of these models is discussed below.
Briefly, available parameters include the following and can generally be set with the GMMClassifierBuilder:
- The number of guassians per label to use. Default is 8.
- Whether or not to use only the diagonal of the cross-correlation matrix on the features.
Briefly, available parameters include the following and can generally be set with the LpDistanceMergedClassifierBuilder :
- p distance parameter. Default is 0.5
- Maximum number of features to memorize in the model before merging of features takes place. Default is 1000.
- Outlier detection can be enabled/disabled. Default is enabled.
- Standard deviation multiplier to control. Default is 3.
Briefly, available parameters include the following and can generally be set with the CNNClassifierBuilder or DCASEClassifierBuilder:
- Batch size. Defaults to 32.
- Early stopping enabled/disabled. Default is true.
- Number of training epochs. Default is 75.
- Number of folds, such that N-1 are used for training and 1 is used for validation. Default is 5 (20% used for validation).
The evaluate
CLI with its -models
option can be used to perform k-fold cross validation to identify the best performing classifier/featuregram extractor combination among a given set. For example,
evaluate -models grid.js -sounds mysounds -label status
where grid.js must set a variable named classifiers
with a map of IClassifier implementations. The keys of
the map are user-defined and help identify the classifier in the ranked list produced. An simple file
might look as follows:
var classifiers = {
"gmm" : new GMMClassifierBuilder().build(),
"lpnn" : new LpDistanceMergeNNClassifierBuilder().build()
}
The above creates the default GMM and Lpnn classifiers for comparison across a given data set. You will likely need to consult the Java doc to see how these and other classes are instantiated.
NOTE: it is strongly recommended to use builders to create the classifiers instead of calling the classifier's constructor directly (e.g., new GMMClassifier()).
A more rigorous classifier list might be created as follows:
// Define a set of sub-window sizes on which the features will be computed
var windowSizes = [40, 80, 120, 240]
// Define the sub-window hop size as a percentage of the sub-window size.
var windowHopPercent= [ .5, 1 ]
// Defines a set of feature extractors
var featureExtractors = [
new FFTFeatureExtractor(),
new MFCCFeatureExtractor(),
new MFFBFeatureExtractor()
]
// Define the option feature processors
var featureProcessors = [
null,
new DeltaFeatureProcessor(2, [1,1,1])
]
// The set of classifier algorithms to test.
var algorithms = [
new GMMClassifierBuilder(),
new LpDistanceMergeNNClassifierBuilder()
]
// The named variable that exports the models to the evaluate CLI
var classifiers = {}
// Simply loop over all parameters to create all combinations of classifier/features.
for (algIndex=0 ; algIndex<algorithms.length; algIndex++) {
var alg = algorithms[algIndex];
for (wsizeIndex=0 ; wsizeIndex<windowSizes.length ; wsizeIndex++) {
var wsize = windowSizes[wsizeIndex];
for (hopIndex=0 ; hopIndex<windowHopPercent.length ; hopIndex++) {
var hop = windowHopPercent * wsize;
for (feIndex=0 ; feIndex<featureExtractors.length ; feIndex++) {
var fe = featureExtractors[feIndex];
for (fpIndex=0 ; fpIndex<featureProcessors.length ; fpIndex++) {
var fp = featureProcessors[fpIndex]
var fge = new FeatureGramExtractor(wsize, hop, fe, fp);
alg.setFeatureGramExtractor(fge);
var c = alg.build();
// The key is used to identify the classifier/feature combination
var key = "algIndex=" + algIndex + ",wsizeIndex=" + wsizeIndex + ",hopIndex=" + hopIndex +
",feIndex=" + feIndex + ",fpIndex=" + fpIndex;
// Add this classifier to the list to be evaluated.
classifiers[key] = alg.build();
}
}
}
}
}