Skip to content

MemoryConsiderations

David Wood edited this page Apr 7, 2022 · 1 revision

Memory Considerations

(Under Construction)

CPU Training study

If we look at the minimum heap size required to CPU train the default GMM model with 48Khz sounds on a 64 element feature vector (MFFB), we can see that it requires about 70MB plus about 100MB per GB of training data.

Now this amount of memory can be impacted by the feature length and any additional memory the model itself may add. For example, if adding DeltaFeatureProcessing with both velocity and acceleration will triple the size of the feature and likely triple the slope as the slope is primarily due to the size of the features for the training data.

GPU Training

GPUs can be used by the CNN and DCASE models, both based on the [Deep Learning For Java (DL4J)[(https://deeplearning4j.org/) library. In addition to heap size, they recommend the use of the following:

  • -Dorg.bytedeco.javacpp.maxbytes
  • -Dorg.bytedeco.javacpp.maxphysicalbytes

We have found the maxphysicalbytes needs to be set to the sum of the maximum Java heap and the maxbytes setting. So for example

export JAVA_OPTIONS="-Xmx32g -_org.bytedeco.javacpp.maxbytes=16g -Dorg.bytedeco.javacpp.maxphysicalbytes=48g"

Memory details

Minimum memory requirements are defined by the following:

  • BS - the size of the batch of windows and their sub-windows on which features are computed.
  • FS - the size of all features computed from the training data.
  • Sm - sub-window multiplier. For rolling windows this is 1 because we duplicate the data when compute sub-windows (sadly). If using sliding windows of 50% this becomes 2 as the number of sub- windows is doubled. 25% sliding windows would be 4.
  • sizeof(double) - which is 8.

In addition to Sm, BS is dependent on the following

  • Nb - The number of training samples within a batch. The batch size is configurable with the feature.extraction.batched.threads AISP properties.
  • Cl - the clip length of each sample
  • Sr - the sampling rate

So that

  • BS = (Sm+1) * Nb * Cl * Sr * sizeof(double)

FS is dependent on feature extraction as follows:

  • Sw - sub-window size in msec.
  • Fl - the length of each feature vector
  • T - the total length of the training data in seconds.
  • Nw - the number of sub-windows / per second = Sm * 1000 / Sw.

So that

  • FS = Fl * Nw * T * sizeof(double)
  • = Fl * Sm * 1000/Sw * T * sizeof(double)

For the following typical values:

  • Nb = 16
  • Cl = 5
  • Sr = 44100
  • Sw = 40
  • Fl = 64
  • Sm = 2

then for 1 hour of training data (T=1),

  • BS = 3 * 16 * 5 * 44100 * 8 = 80.74 MB
  • FS = 64 * 2 * 1000/40 * 1*3600 * 8 = 87.89 MB

About 80 MB plus 90 MB/hour of training data. And in fact, we have shown that 50 minutes of training data (~ 1gb) can be trained with a 256 MB heap.

The Long Answer

The Java training pipeline is designed to promote streaming data - both for source data (i.e. sound, vibration, etc.) and the features and featuregrams computed as input to the model training step. To better understand the following discussion, one should have a proper understanding of the steps involved in training a model, from data ingestion to feature computation to a trained model. Roughly speaking they are as follows:

  1. Provide streamed access to the raw training data. This is typically from the file system (i.e. .wav files) or the Mongo database (i.e. serialized SoundClips created from .wav files).
  2. Compute the feature/spectrogram(s) defined by the FeatureGramExtractor(s) for the classifier. This can be broken down into the following steps:
    1. Breaking the window up into the sub-windows from which individual features will be computed.
    2. Computation of the features for all sub-windows of all sound samples.
    3. Application of a feature gram processor (i.e., FeatureProcessor), if included.
  3. Training of the model on the computed feature grams. This may involve a batched learning on batches of feature grams depending on the model (e.g., CNN,DCASE).

In general, the input to each of the steps above is streamed data (i.e. a Java Iterable), such that not all the data needs to be in memory at one time. Instead it is brought in as accessed. Now, in addition to the streaming data, intermediate results, such as raw data, sub-windows, individual features, and feature grams are cached in case they might be needed in subsequent iterations or by other models (the Ensemble may use models that use the same feature extraction for example).

Caching is done using a memory-based cache and optionally a Mongo-based disk cache for feature grams using Mongo. The memory cache is used for caching intermediates from all steps above. It will generally fill up to use all of the Java heap and when memory is needed the GC will evict items from the cache as needed. The cache can help avoid rereading a .wav file or recomputing sub-windows or features when multiple iterations are made over the data (i.e. Ensemble for epochs of a neural network). That said, there is no prioritization of items in the cache so that sub-windows may be evicted prior to their source windows.

So, if you can afford a heap size (using -Xms and -Xmx options) to accommodate all items in the cache w/o eviction, that is best. As a back of the envelope computation, this would be something like twice the size of the training data in memory (2 for computing sub-windows), as follows:

1 hour * 2 * 60 min/hour * 60 sec/min * 44100 samples/second * sizeof(double)

or about 2.4 gb / hour of training data. If you using sliding, overlapping sub-windows, then this number will go up due to more data duplication. So if you are training 50 hours (100gb) of data on an 16gb machine will be evicting items from the cache. This may not be a problem once you get all your feature grams computed as they will generally be held onto by hard references during training and so can't be evicted.

However, this hard referencing of features places a additional requirements on the minimum heap sie. As a ballpark for feature extraction we can consider a 50 msec sub-windows, 50% sliding windows, 64 element feature vector, gives the following

1 hour * 60 min/hour * 60 sec/min (1000/50 * 2) sub-windows/sec * 64 * sizeof(double) or about 70 MB/hour of training data. So we can see that features take up 3% of the training data. This might double or triple if you include a DeltaFeatureProcessor that adds velocity and acceleration feature grams. So maybe 10% as a rough upper bound.

So again, if you can afford a heap that can hold all of your data and features do that. But if you need to train with a limited heap there are some points that come into play, specifically around batching of computations and thread counts.

First, features are computed from batches of data windows.
This forces the full batch of data windows to be in memory at once and thus impacts a lower bound on the amount of required memory. Models such as CNN and DCASE that train on batches of features also require the data to be held in memory, but this is lesser effect due to the smaller size of the features. The batch size for feature computation is defined by the feature.iterable.batch_size which defaults to 8 * Number of cores.
1 seconds/window * (8 * Number-of-cores) windows * 2 * 44100 samples/sec * sizeof (double)
= 5.5 Mb/core/second of window length

Secondly, each thread created has memory cost in initial stack size (as settable by the -Xss JVM option).
we have found that if too many threads are created in implementing the parallelism for batched feature extraction, we can get OOM exceptions.

With this fairly high-level view of the process we can now understand the impact of some of the JVM and AISPProperties configurations that are available.

Clone this wiki locally