-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy path253
More file actions
38 lines (19 loc) · 15.3 KB
/
253
File metadata and controls
38 lines (19 loc) · 15.3 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
Hello. My name is John Cardente. I am a member of the Dell Storage CTO team. Today, I will be discussing storage requirements for AI workloads with particular focus on performance for model training. Please use the chat feature to submit any questions or comments.
The AI boom is undeniable. Many companies are racing to build out AI data centers to deliver advanced product features and streamline operations. Modern deep learning AI models require millions of matrix operations. Parallelism is needed to make these calculations computationally feasible within reasonable amounts of time. GPUs are designed to perform parallel matrix operations extremely quickly and cost effectively. They are the technology making advanced AI computationally and economically feasible. The race to build out massive AI data centers is causing GPU demand to outstrip supply, leading to scarcity and high cost. As a result, companies must maximize the use of the GPUs they have. Maximizing GPU utilization is becoming the main AI data center design goal.
Designing data centers to maximize GPU utilization requires a balanced end-to-end architecture to avoid performance bottlenecks. While considerable attention is often given to high performance GPU servers and the substantial east-west networks needed for GPU to GPU communication, storage plays an equally critical role as data is the fuel for AI.
Storage is involved across the entire AI lifecycle. It begins with data preparation, where raw data is transformed, often with distributed processing frameworks, into a form suitable for AI model training and batch inference. Storing and providing performant access to unstructured and structured data in open data formats is critical. Simultaneous file and object access avoids extraneous data copies when intermixing analytics and AI software stacks that prefer different storage protocols. After the data is suitably transformed, it is used to train or tune AI models. Substantial storage read bandwidth may be required to keep GPUs fed with training data. Similarly, significant storage write bandwidth may be needed to quickly save model checkpoints during days or week-long training sessions. Better understanding these performance requirements is the focus of this talk. Once trained, AI models are used to generate inferences from new data. Deploying inference services requires quickly reading model artifacts from storage. Batch inferencing can demand large read bandwidths to feed data to GPUs.
Having a basic understanding of AI model training will be helpful for this discussion. A training data set is a collection of examples, each containing a set of model inputs, called features, and the associated expected ground truth inference output. The training data set may be stored as one file or a collection of files. Training data is read from storage and packaged into randomized batches containing a relatively small number of training examples. These batches of examples are run through the model to generate inference outputs, which are compared to the ground truth to calculate a loss score reflecting how well the model did. The loss score is then used to update the model's weights, also called parameters, so that the model's accuracy improves. This process is repeated many times until the model converges to a stable level of accuracy. Multiple passes over the training data may be needed to achieve this. Each pass is called an epoch. And for small training data sets, the data may be cached in server memory after the first epoch. During training, the AI model state is periodically saved or checkpointed to guard against failure.
Limited information exists on storage read performance requirements for training. If the goal is to keep GPUs busy, a reasonable approach is to use AI GPU benchmarks to determine peak performance for a variety of models and then work backwards to determine the storage read performance needed to sustain that amount of work. The MLCommons ML Perf training benchmark is ideal for this purpose. It's designed to saturate GPUs while training a variety of popular AI models representing different use cases. Results from multiple sources are publicly available for analysis.
This table shows results from the ML Perf training benchmarks version 3.0 submissions for NVIDIA H100 80 gigabyte GPUs, the top performing GPU at that time. The first three columns show the models that were trained, their sizes and parameters, and details about their associated training data sets. The fourth column provides the size in bytes of each individual training example fed to the model during training. The fifth and sixth columns specify the number of GPUs and the amount of throughput they were able to achieve in terms of training examples per second. Eight GPU results are shown to represent the typical configuration of high end AI servers. For GPT-3, 32 GPU results are shown as a single model instance required that many GPUs. The final column estimates the amount of storage read bandwidth needed to sustain the associated level of GPU throughput. This reflects storage read performance requirements while training data is being read during the first epoch and subsequent epochs if the data set is too large to fit in memory. The estimated storage read bandwidths vary widely across the different models. For example, GPT-3 only needs around 150 megabytes per second to keep 32 H100 GPUs busy. This is due to GPT-3 being extremely compute bound and its tokenized text training examples being relatively small at 8 kilobytes each. Conversely, eight H100 GPUs training 3D U-Net can require over 40 gigabytes per second to stay highly utilized. In this case, 3D U-Net is much less compute bound and its training example images are very large, about 92 megabytes each, hence the higher read bandwidth needed to keep up with the GPUs. ResNet-50 falls somewhere in the middle, requiring approximately 6.1 gigabytes per second to keep eight H100 GPUs fully utilized. These examples show the trade-offs between model computational complexity and training example size that must be considered when estimating requirements. Remember, big models don't always need lots of storage read bandwidth.
Storage system performance often depends on the way in which data is accessed. As an example, we'll look at the I/O access patterns generated while simulating the training of a ResNet-50 image classification model using the DLIO benchmark. DLIO is an AI storage benchmark that emulates GPUs but aims to preserve all other aspects of AI model training, including using real deep learning framework data loaders and training data file formats. DLIO is designed to generate stressful AI workloads to measure storage performance without requiring GPUs. In this case, the training data set is a collection of TFRecord files, each containing 1,024 150-kilobyte image tensors. The training data will be read using TensorFlow's data loader library.
I/O traces were collected while the training data was read from an NFS share. The graph at top left shows a steady stream of 64 to 256-kilobyte I/Os that continuously read the training data during model training. The chart at right shows that the distribution of I/O sizes didn't materially change as the number of training examples per batch was varied. The graph at bottom left shows that, in this case, the training files were read sequentially one after the other. Therefore, using the prior estimate for this workload, an AI storage system may need to achieve 6.1 gigabytes per second of read bandwidth for a sequential read stream of 64 to 256-kilobyte I/Os to achieve full GPU utilization while training ResNet-50. However, other models may exhibit a more random I/O access pattern using different I/O sizes. The implication is that AI storage systems need to perform well for a variety of access patterns.
Reading the same ResNet-50 training data over S3 results in a very different I/O profile. The chart at top left shows that much larger 20 to 50 megabyte I/Os are used to read the training data. The chart at top right shows that, once again, the I/O size distribution doesn't change with batch size. The chart at bottom left shows that the files are still accessed sequentially in turn. The performance implications of this significantly different I/O access pattern depend on the particular object storage and S3 software library used. While throughput is a concern for both NFS and S3, latency may be an additional consideration when using S3 as training cannot start or continue until these large I/Os complete.
Another consideration between NFS and S3 are the benefits of the OS page cache when training multiple models within a single server. In this case, the individual model instances may all read the same training data. When the training data is accessed over NFS, the OS page cache may satisfy duplicate I/O requests from the different models and avoid extra storage read accesses. Because S3 bypasses the OS page cache, there is no server-side caching, and duplicate read I/Os may be sent to storage. Thus, increasing storage read performance requirements. This example shows that the storage protocol used to read training data is another important consideration when choosing AI storage solutions.
Let's now talk about checkpoints. Large AI models may take days or weeks to train. During that time, model weights are constantly changing as training data is processed. The monetary value of these weights increases over time due to the computational resources consumed to produce them. Checkpoints protect against losing this valuable data by periodically saving the current model weights and other state to persistent storage. Checkpoints are typically saved as one or more files, each sequentially written by a single writer. The number of files depends on the degree of model parallelism and how a particular model may implement checkpoints. When using data parallelism, only a single model instance needs to be saved and not the memory content of all the GPUs in the system. Since training is typically paused during checkpointing, quickly saving checkpoints is a key storage performance requirement.
This table estimates the aggregate write bandwidth required to complete checkpoints of different sizes within different time constraints. The first column at left specifies different model sizes in terms of the number of parameters. The second column estimates the total checkpoint size using a common rule of thumb of 14 bytes per parameter, the model weight plus optimizer state. The remaining columns provide estimated aggregate write bandwidths required to complete the checkpoints within different time constraints expressed a percentage of an assumed two-hour checkpoint interval. For example, a 175-billion-parameter model reduces 2.4 terabytes of checkpoint data. Saving that checkpoint within 360 seconds, or 5% of the two-hour checkpoint interval, requires 6.8 gigabytes per second of aggregate write bandwidth. If accomplished, this means 95% of the two hours will be available for model training. These estimates show that storage write bandwidth requirements can vary greatly based on model size and time limits. Large models with tight checkpoint time limits can require significant write bandwidth. However, the requirements drop significantly when models are smaller or more time is allowed to checkpoint. Understanding these factors is important while assessing the cost trade-offs between storage solutions and idle GPUs.
Checkpoints enable the resumption of model training after an unplanned or planned interruption. This requires restoring checkpoint data to all GPUs involved in training. Each checkpoint file is read sequentially by one or more readers based on the number of model instances being restored. For example, the diagram at left shows one checkpoint file being loaded into three GPUs, each holding the same portion of the model across three instances. The other checkpoint files would be similarly read three times to restore the remaining GPUs. Checkpoint files are typically read in parallel, resulting in a larger number of concurrent sequential read streams to multiple files. Since model training is unable to resume until the checkpoint has been restored to all GPUs, it is desirable to complete the restore quickly.
This table estimates the aggregate storage read bandwidth required to restore checkpoints of various sizes within five minutes for different numbers of model instances. The five minute time limit is just an example. Actual checkpoint restore time limits depend on how often restores are needed and the expectations of AI engineers. As before, the first two columns show different model sizes and their associated checkpoint sizes. The remaining columns show the aggregate read bandwidth estimates. For example, restoring a 175 billion parameter model to 16 model instances within five minutes requires 2.18 gigabytes per second of read bandwidth to read the 2.4 terabyte checkpoint 16 times. Here again, requirements vary greatly depending on model size, training parallelism, and the amount of time allowed to restore checkpoints.
So far, this discussion has focused on singular workloads. In reality, a modern GPU cluster may be hosting a large number of AI workloads at different stages of the AI lifecycle. Distributed schedulers are used to assign jobs to servers in the cluster. Equal access to data, regardless of placement, is often assumed to simplify job deployment and to avoid extraneous data copies. This means AI storage must be able to meet the performance needs of multiple workloads with significantly different AI access patterns, all from a single namespace, and must be able to scale out as the GPU cluster grows with business needs.
To summarize, storage is involved in all stages of the AI lifecycle. Requirements vary widely across lifecycle stages, different AI model types, and AI infrastructure user expectations. Concurrent sequential and random read performance is important for feeding GPUs with model training data and restoring checkpoints. Single-threaded sequential write performance is important to quickly saving model checkpoints. Storage systems for modern AI GPU clusters must be able to handle a mixture of workloads from a single namespace with the ability to scale out capacity and performance as needed. So choosing an appropriate AI storage solution requires insight into expected workloads and service expectations to meet needs while also managing acquisition and total ownership costs.
Finally, this presentation talked about performance, which is often the focus of many AI storage conversations. However, AI also needs traditional enterprise storage capabilities, like data protection, high availability, encryption, DDoS, security, data lifecycle management, and more. These capabilities quickly become important once performance requirements are met. As said in the beginning, data is the fuel for AI. Protecting and managing that data is critical. AI storage systems must do more than providing high performance.
Thank you for attending this session. I hope it has helped you to better understand AI storage requirements. Please contact me at the below email with any feedback or questions, or leave those questions and comments in the chat. I hope you enjoy the rest of the conference. Thank you. Goodbye.