Provisioning Heterogeneous Spot Instances For Predictable Distributed DNN Training In The Cloud

Posted on:2024-05-18

Degree:Master

Type:Thesis

Country:China

Candidate:R T Shang

Full Text:PDF

GTID:2558307067993069

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Recently,it has become increasingly popular to deploy Distributed Deep Neural Network(DDNN)training workloads on cloud spot instances to save the monetary cost.The DDNN training process can be unexpectedly interrupted as spot instances can be revoked by the cloud provider at any time.As a result,training DDNN workloads with spot instances often leads to severe performance degradation.To handle the performance degradation caused by the spot instance revocation,provisioning a heterogeneous cluster with the parameter server(PS)architecture and the asynchronous parallel(ASP)mechanism becomes the dominant method for DDNN training with spot instances in the cloud.However,blindly provisioning a cluster of spot instances can easily result in unpredictable DDNN training performance.Extensive motivation experiments have been conducted,and the results show that such unpredictable training performance is mainly caused by the following two reasons: First,the more workers provisioned in the cluster,the greater the competition between the workers for the PS network bandwidth.At the same time,the limited PCIe bandwidth within the workers also limits the DDNN training speed.Second,cluster heterogeneity can impact the DDNN training convergence rate.Inadequate cluster heterogeneity can increase the gradient staleness in the ASP mechanism,slowing down the DDNN training convergence rate.To address the challenges above,this thesis designs and implements spot DNN,a heterogeneity-aware spot instance provisioning framework that provides predictable performance for DDNN training in the cloud while saving the monetary cost for users.Specifically,an analytical performance model of DDNN training in heterogeneous clusters is designed by explicitly considering the contention of workers for bottleneck resources(i.e.,PS network bandwidth and PCIe bandwidth)during the communication process.It leverages the weighted average batch size and convergence coefficient to quantify the DDNN training loss in heterogeneous clusters.Based on the performance model and a lightweight workload profiling,a cost-efficient spot instance provisioning strategy is designed,which incorporates the bounds calculation for the number of provisioned workers and sliding window techniques to guarantee the DDNN training performance service level objectives(SLOs).Extensive prototype experiments based on Amazon EC2 show that spot DNN can provide predictable DDNN training performance for cloud users,while saving the monetary cost of cloud instances by up to 68.1% in comparison to the existing solutions,yet with tolerable runtime overhead.

Keywords/Search Tags:

distributed DNN training, predictable performance, spot instance provisioning, heterogeneous clusters

PDF Full Text Request

Related items

1	Research On Performance Guarantee Of Distributed Dnn Training With Serverless Architectures
2	Distributed Training Optimization In Heterogeneous Clusters
3	Research On Cost-Efficient Cloud Resource Provisioning For Predictable Deep Neural Network Training
4	Research On Interference-aware GPU Resource Provisioning For Predictable DNN Inference
5	Efficient Resource Allocation Technology For Partially Predictable Deep Learning Training Jobs
6	The Research Of A Multi-Objectives Dynamic Hybrid Cloud Resource Provisioning Mechanism
7	Hybrid Cloud Resource Provisioning For Real-time Workflows Scheduling
8	Towards Predictable Performance In IaaS Clouds:Per-Formance Optimization And Scheduling Of Virtual Ma-Chine Workloads
9	Research On Distributed Stochastic Gradient Descent Algorithm
10	Automated provisioning of backend databases in shared dynamic content server clusters