Font Size: a A A

Provisioning Heterogeneous Spot Instances For Predictable Distributed DNN Training In The Cloud

Posted on:2024-05-18Degree:MasterType:Thesis
Country:ChinaCandidate:R T ShangFull Text:PDF
GTID:2558307067993069Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Recently,it has become increasingly popular to deploy Distributed Deep Neural Network(DDNN)training workloads on cloud spot instances to save the monetary cost.The DDNN training process can be unexpectedly interrupted as spot instances can be revoked by the cloud provider at any time.As a result,training DDNN workloads with spot instances often leads to severe performance degradation.To handle the performance degradation caused by the spot instance revocation,provisioning a heterogeneous cluster with the parameter server(PS)architecture and the asynchronous parallel(ASP)mechanism becomes the dominant method for DDNN training with spot instances in the cloud.However,blindly provisioning a cluster of spot instances can easily result in unpredictable DDNN training performance.Extensive motivation experiments have been conducted,and the results show that such unpredictable training performance is mainly caused by the following two reasons: First,the more workers provisioned in the cluster,the greater the competition between the workers for the PS network bandwidth.At the same time,the limited PCIe bandwidth within the workers also limits the DDNN training speed.Second,cluster heterogeneity can impact the DDNN training convergence rate.Inadequate cluster heterogeneity can increase the gradient staleness in the ASP mechanism,slowing down the DDNN training convergence rate.To address the challenges above,this thesis designs and implements spot DNN,a heterogeneity-aware spot instance provisioning framework that provides predictable performance for DDNN training in the cloud while saving the monetary cost for users.Specifically,an analytical performance model of DDNN training in heterogeneous clusters is designed by explicitly considering the contention of workers for bottleneck resources(i.e.,PS network bandwidth and PCIe bandwidth)during the communication process.It leverages the weighted average batch size and convergence coefficient to quantify the DDNN training loss in heterogeneous clusters.Based on the performance model and a lightweight workload profiling,a cost-efficient spot instance provisioning strategy is designed,which incorporates the bounds calculation for the number of provisioned workers and sliding window techniques to guarantee the DDNN training performance service level objectives(SLOs).Extensive prototype experiments based on Amazon EC2 show that spot DNN can provide predictable DDNN training performance for cloud users,while saving the monetary cost of cloud instances by up to 68.1% in comparison to the existing solutions,yet with tolerable runtime overhead.
Keywords/Search Tags:distributed DNN training, predictable performance, spot instance provisioning, heterogeneous clusters
PDF Full Text Request
Related items