Optimizing Elastic Deep Learning in GPU Clusters with AdaptDL for PyTorch
AdaptDL monitors training job performance in real-time, and elastically re-scales resources (GPUs, compute instances) while jobs are running. For each training job, AdaptDL automatically tunes the batch size, learning rate, and gradient accumulation. In the cloud (e.g. AWS), AdaptDL can auto-scale the number of provisioned Spot Instances. We’ve seen shared-cluster training jobs at Petuum and our partners complete 2–3x faster on average, with 3x cheaper cost in AWS using Spot Instances!
AdaptDL from Petuum is an Open Source resource-adaptive deep learning (DL) training and scheduling framework. AdaptDL makes distributed DL easy and efficient in dynamic-resource environments such as shared clusters and the cloud. AdaptDL can automatically determine the optimal number of resources given a job’s need and will add or remove resources dynamically to ensure the highest-level performance.
During our benchmark studies when using AdaptDL with Amazon Web Services (AWS), we recorded a reduction in cost by up to 80% when AdaptDL was set to automatically provision spot instances on AWS when available.
AdaptDL can automatically determine the optimal number of resources given a job’s need. It will efficiently add or remove resources dynamically to ensure the highest-level performance. Using a scheduler to leverage elasticity, AdaptDL quickly scales resources in and out of clusters to adapt to a changing availability pool allowing for faster job completion and more efficient resource allocation.
AutoDist, a distributed deep learning training engine. It provides an easy to use interface to automatically distribute the training of a wide variety of deep learning models across many CPUs and GPUs at scale with very minimal code change.
AutoDist allows a developer to scale a model from a single GPU to many, without requiring changes to your model building scripts. We’ve approached this from a different perspective — graph optimization with composable representation of strategies to enable manual crafting or automated selection of the best options for your model training.
AdaptDL monitors training job performance in real-time, and elastically re-scales resources (GPUs, compute instances) while jobs are running. For each training job, AdaptDL automatically tunes the batch size, learning rate, and gradient accumulation. In the cloud (e.g. AWS), AdaptDL can auto-scale the number of provisioned Spot Instances. We’ve seen shared-cluster training jobs at Petuum and our partners complete 2–3x faster on average, with 3x cheaper cost in AWS using Spot Instances!