CDI-Info/253 at main · vaj/CDI-Info · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38

Hello. My name is John Cardente. I am a member of the Dell Storage CTO team. Today, I will be discussing storage requirements  for AI workloads with particular focus on performance  for model training. Please use the chat feature to submit  any questions or comments.

The AI boom is undeniable. Many companies are racing to build out AI data centers  to deliver advanced product features  and streamline operations. Modern deep learning AI models require millions  of matrix operations. Parallelism is needed to make these calculations  computationally feasible within reasonable amounts of time. GPUs are designed to perform parallel matrix operations  extremely quickly and cost effectively. They are the technology making advanced AI computationally  and economically feasible. The race to build out massive AI data centers  is causing GPU demand to outstrip supply,  leading to scarcity and high cost. As a result, companies must maximize  the use of the GPUs they have. Maximizing GPU utilization is becoming the main AI data  center design goal.

Designing data centers to maximize GPU utilization  requires a balanced end-to-end architecture  to avoid performance bottlenecks. While considerable attention is often  given to high performance GPU servers  and the substantial east-west networks needed for GPU  to GPU communication, storage plays an equally critical role  as data is the fuel for AI.

Storage is involved across the entire AI lifecycle. It begins with data preparation, where raw data is transformed,  often with distributed processing frameworks,  into a form suitable for AI model training and batch  inference. Storing and providing performant access  to unstructured and structured data in open data formats  is critical. Simultaneous file and object access  avoids extraneous data copies when  intermixing analytics and AI software stacks that prefer  different storage protocols. After the data is suitably transformed,  it is used to train or tune AI models. Substantial storage read bandwidth  may be required to keep GPUs fed with training data. Similarly, significant storage write bandwidth  may be needed to quickly save model checkpoints  during days or week-long training sessions. Better understanding these performance requirements  is the focus of this talk. Once trained, AI models are used to generate inferences  from new data. Deploying inference services requires quickly reading model  artifacts from storage. Batch inferencing can demand large read bandwidths  to feed data to GPUs.

Having a basic understanding of AI model training  will be helpful for this discussion. A training data set is a collection  of examples, each containing a set of model inputs,  called features, and the associated expected ground  truth inference output. The training data set may be stored as one file  or a collection of files. Training data is read from storage  and packaged into randomized batches containing  a relatively small number of training examples. These batches of examples are run through the model  to generate inference outputs, which  are compared to the ground truth to calculate a loss score  reflecting how well the model did. The loss score is then used to update the model's weights,  also called parameters, so that the model's accuracy improves. This process is repeated many times  until the model converges to a stable level of accuracy. Multiple passes over the training data  may be needed to achieve this. Each pass is called an epoch. And for small training data sets,  the data may be cached in server memory after the first epoch. During training, the AI model state  is periodically saved or checkpointed  to guard against failure.

Limited information exists on storage read performance  requirements for training. If the goal is to keep GPUs busy,  a reasonable approach is to use AI GPU benchmarks  to determine peak performance for a variety of models  and then work backwards to determine the storage read  performance needed to sustain that amount of work. The MLCommons ML Perf training benchmark  is ideal for this purpose. It's designed to saturate GPUs while training  a variety of popular AI models representing different use  cases. Results from multiple sources are publicly  available for analysis.

This table shows results from the ML Perf training benchmarks  version 3.0 submissions for NVIDIA H100 80 gigabyte  GPUs, the top performing GPU at that time. The first three columns show the models  that were trained, their sizes and parameters,  and details about their associated training data sets. The fourth column provides the size in bytes  of each individual training example fed  to the model during training. The fifth and sixth columns specify the number of GPUs  and the amount of throughput they  were able to achieve in terms of training examples per second. Eight GPU results are shown to represent  the typical configuration of high end AI servers. For GPT-3, 32 GPU results are shown as a single model  instance required that many GPUs. The final column estimates the amount of storage read  bandwidth needed to sustain the associated level of GPU  throughput. This reflects storage read performance requirements  while training data is being read  during the first epoch and subsequent epochs  if the data set is too large to fit in memory. The estimated storage read bandwidths  vary widely across the different models. For example, GPT-3 only needs around 150 megabytes  per second to keep 32 H100 GPUs busy. This is due to GPT-3 being extremely compute bound  and its tokenized text training examples being relatively  small at 8 kilobytes each. Conversely, eight H100 GPUs training 3D U-Net  can require over 40 gigabytes per second  to stay highly utilized. In this case, 3D U-Net is much less compute bound  and its training example images are very large,  about 92 megabytes each, hence the higher read bandwidth  needed to keep up with the GPUs. ResNet-50 falls somewhere in the middle,  requiring approximately 6.1 gigabytes per second  to keep eight H100 GPUs fully utilized. These examples show the trade-offs  between model computational complexity and training example  size that must be considered when estimating requirements. Remember, big models don't always  need lots of storage read bandwidth.

Storage system performance often depends on the way  in which data is accessed. As an example, we'll look at the I/O access patterns generated  while simulating the training of a ResNet-50 image  classification model using the DLIO benchmark. DLIO is an AI storage benchmark  that emulates GPUs but aims to preserve all other aspects of AI  model training, including using real deep learning framework  data loaders and training data file formats. DLIO is designed to generate stressful AI workloads  to measure storage performance without requiring GPUs. In this case, the training data set  is a collection of TFRecord files,  each containing 1,024 150-kilobyte image tensors. The training data will be read using TensorFlow's data loader  library.

I/O traces were collected while the training data  was read from an NFS share. The graph at top left shows a steady stream  of 64 to 256-kilobyte I/Os that continuously read the training  data during model training. The chart at right shows that the distribution of I/O sizes  didn't materially change as the number of training examples  per batch was varied. The graph at bottom left shows that, in this case,  the training files were read sequentially  one after the other. Therefore, using the prior estimate for this workload,  an AI storage system may need to achieve 6.1 gigabytes  per second of read bandwidth for a sequential read  stream of 64 to 256-kilobyte I/Os  to achieve full GPU utilization while training ResNet-50. However, other models may exhibit a more random I/O  access pattern using different I/O sizes. The implication is that AI storage systems  need to perform well for a variety of access patterns.

Reading the same ResNet-50 training data over S3  results in a very different I/O profile. The chart at top left shows that much larger 20 to 50 megabyte  I/Os are used to read the training data. The chart at top right shows that, once again,  the I/O size distribution doesn't  change with batch size. The chart at bottom left shows that the files are still  accessed sequentially in turn. The performance implications of this significantly different  I/O access pattern depend on the particular object storage  and S3 software library used. While throughput is a concern for both NFS and S3,  latency may be an additional consideration when using S3  as training cannot start or continue  until these large I/Os complete.

Another consideration between NFS and S3  are the benefits of the OS page cache  when training multiple models within a single server. In this case, the individual model instances  may all read the same training data. When the training data is accessed over NFS,  the OS page cache may satisfy duplicate I/O requests  from the different models and avoid extra storage read  accesses. Because S3 bypasses the OS page cache,  there is no server-side caching, and duplicate read I/Os  may be sent to storage. Thus, increasing storage read performance requirements. This example shows that the storage protocol  used to read training data is another important consideration  when choosing AI storage solutions.

Let's now talk about checkpoints. Large AI models may take days or weeks to train. During that time, model weights are constantly  changing as training data is processed. The monetary value of these weights  increases over time due to the computational resources  consumed to produce them. Checkpoints protect against losing this valuable data  by periodically saving the current model  weights and other state to persistent storage. Checkpoints are typically saved as one or more files,  each sequentially written by a single writer. The number of files depends on the degree of model parallelism  and how a particular model may implement checkpoints. When using data parallelism, only a single model instance  needs to be saved and not the memory content of all the GPUs  in the system. Since training is typically paused during checkpointing,  quickly saving checkpoints is a key storage performance  requirement.

This table estimates the aggregate write bandwidth  required to complete checkpoints of different sizes  within different time constraints. The first column at left specifies different model  sizes in terms of the number of parameters. The second column estimates the total checkpoint size  using a common rule of thumb of 14 bytes per parameter,  the model weight plus optimizer state. The remaining columns provide estimated aggregate write  bandwidths required to complete the checkpoints  within different time constraints expressed  a percentage of an assumed two-hour checkpoint interval. For example, a 175-billion-parameter model  reduces 2.4 terabytes of checkpoint data. Saving that checkpoint within 360 seconds, or 5%  of the two-hour checkpoint interval,  requires 6.8 gigabytes per second  of aggregate write bandwidth. If accomplished, this means 95% of the two hours  will be available for model training. These estimates show that storage write bandwidth  requirements can vary greatly based  on model size and time limits. Large models with tight checkpoint time limits  can require significant write bandwidth. However, the requirements drop significantly  when models are smaller or more time is allowed to checkpoint. Understanding these factors is important  while assessing the cost trade-offs between storage  solutions and idle GPUs.

Checkpoints enable the resumption of model training  after an unplanned or planned interruption. This requires restoring checkpoint data  to all GPUs involved in training. Each checkpoint file is read sequentially  by one or more readers based on the number of model instances  being restored. For example, the diagram at left shows one checkpoint file  being loaded into three GPUs, each holding  the same portion of the model across three instances. The other checkpoint files would be similarly read three times  to restore the remaining GPUs. Checkpoint files are typically read in parallel,  resulting in a larger number of concurrent sequential read  streams to multiple files. Since model training is unable to resume until the checkpoint  has been restored to all GPUs, it  is desirable to complete the restore quickly.

This table estimates the aggregate storage read  bandwidth required to restore checkpoints  of various sizes within five minutes  for different numbers of model instances. The five minute time limit is just an example. Actual checkpoint restore time limits  depend on how often restores are needed  and the expectations of AI engineers. As before, the first two columns show different model sizes  and their associated checkpoint sizes. The remaining columns show the aggregate read bandwidth  estimates. For example, restoring a 175 billion parameter model  to 16 model instances within five minutes  requires 2.18 gigabytes per second of read bandwidth  to read the 2.4 terabyte checkpoint 16 times. Here again, requirements vary greatly  depending on model size, training parallelism,  and the amount of time allowed to restore checkpoints.

So far, this discussion has focused on singular workloads. In reality, a modern GPU cluster may  be hosting a large number of AI workloads  at different stages of the AI lifecycle. Distributed schedulers are used to assign jobs  to servers in the cluster. Equal access to data, regardless of placement,  is often assumed to simplify job deployment  and to avoid extraneous data copies. This means AI storage must be able to meet the performance  needs of multiple workloads with significantly different AI  access patterns, all from a single namespace,  and must be able to scale out as the GPU cluster grows  with business needs.

To summarize, storage is involved in all stages  of the AI lifecycle. Requirements vary widely across lifecycle stages,  different AI model types, and AI infrastructure user  expectations. Concurrent sequential and random read performance  is important for feeding GPUs with model training data  and restoring checkpoints. Single-threaded sequential write performance  is important to quickly saving model checkpoints. Storage systems for modern AI GPU clusters  must be able to handle a mixture of workloads  from a single namespace with the ability  to scale out capacity and performance as needed. So choosing an appropriate AI storage solution  requires insight into expected workloads and service  expectations to meet needs while also managing acquisition  and total ownership costs.

Finally, this presentation talked  about performance, which is often  the focus of many AI storage conversations. However, AI also needs traditional enterprise storage  capabilities, like data protection, high availability,  encryption, DDoS, security, data lifecycle management,  and more. These capabilities quickly become important  once performance requirements are met. As said in the beginning, data is the fuel for AI. Protecting and managing that data is critical. AI storage systems must do more than providing  high performance.

Thank you for attending this session. I hope it has helped you to better understand AI storage  requirements. Please contact me at the below email  with any feedback or questions, or leave those questions  and comments in the chat. I hope you enjoy the rest of the conference. Thank you. Goodbye.