Research On Performance Guarantee Of Distributed Dnn Training With Serverless Architectures

Posted on:2022-01-14

Degree:Master

Type:Thesis

Country:China

Candidate:Y L Qin

Full Text:PDF

GTID:2518306482489464

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

As the paradigm of next-generation cloud computing,serverless computing further abstracts the resources of cloud computing in the form of functions.Cloud service providers are responsible for provisioning,managing,deploying,and scaling the resources required for the user applications,and providing the millisecond-level charging to users.Users can focus on programming and only need to pay for the time and resources occupied by the actual running process of user applications.Based on the advantages of high efficiency and low monetary cost mentioned above,adopting serverless computing platform to train Distributed Deep Neural Network(DDNN)workloads is becoming a trend,as it allows users to decompose a complex model training task into a number of functions without configuring and managing the cluster resources.Although serverless computing can provide users with a simple resource interface(i.e.,the number and memory size of functions),how to allocate function resources for the DDNN training workload remains challenging.The main reason is that inadequate function resource provisioning(i.e.,either under-provisioning or over-provisioning)can easily lead to unpredictable DDNN training performance in serverless platforms.Through analyzing our motivation experiment results of DDNN training workloads in AWS Lambda,such unpredictable performance of serverless DDNN training is mainly caused by two key factors.First,the network I/O bandwidth of Parameter Servers(PS)can easily become the resource bottleneck.Second,the small local batch size for DDNN training can lead to the low resource utilization of functions.To solve the performance issues above,this thesis designs and implements λDNN,a cost-efficient function resource provisioning framework,to provide predictable performance for serverless DDNN training workloads,while saving the users’ budget.Specifically,a lightweight analytical DDNN training performance model is built by leveraging the available PS network bandwidth and the influence of function CPU utilization caused by small local batch size.Based on such a model,a serverless computing resource allocation policy is designed for DDNN training workloads by analyzing the upper and lower bounds of the memory size and number of functions.λDNN aims to guarantee the DDNN training performance while reducing the budget of DDNN training workloads by optimizing the function resource provisioning plan.Extensive prototype experiments on AWS Lambda and complementary trace-driven simulations demonstrate that λDNN can deliver predictable DDNN training performance and save the monetary cost of function resources by up to 66.7%,compared with the state-ofthe-art resource provisioning strategies,yet with an acceptable runtime overhead.

Keywords/Search Tags:

Distributed DNN training, serverless computing, predictable performance, function resource provisioning

PDF Full Text Request

Related items

1	Provisioning Heterogeneous Spot Instances For Predictable Distributed DNN Training In The Cloud
2	Research On Cost-Efficient Cloud Resource Provisioning For Predictable Deep Neural Network Training
3	Research On Interference-aware GPU Resource Provisioning For Predictable DNN Inference
4	Research On Serverless Computing Technology For Cloud Cryptographic Resource Poo
5	Double Machine Learning Model Training Optimization Research Based On The Serverless Architecture
6	Towards Predictable Performance In IaaS Clouds:Per-Formance Optimization And Scheduling Of Virtual Ma-Chine Workloads
7	Design And Implementation Of A Computing Platform Based On Serverless Architecture
8	Task Resource Allocation Optimization Strategy For Serverless System Revenue And Cold-start Rate
9	Research On Distributed Storage System For Serverless Computing
10	Efficient Resource Allocation Technology For Partially Predictable Deep Learning Training Jobs