Font Size: a A A

Research On Performance Guarantee Of Distributed Dnn Training With Serverless Architectures

Posted on:2022-01-14Degree:MasterType:Thesis
Country:ChinaCandidate:Y L QinFull Text:PDF
GTID:2518306482489464Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As the paradigm of next-generation cloud computing,serverless computing further abstracts the resources of cloud computing in the form of functions.Cloud service providers are responsible for provisioning,managing,deploying,and scaling the resources required for the user applications,and providing the millisecond-level charging to users.Users can focus on programming and only need to pay for the time and resources occupied by the actual running process of user applications.Based on the advantages of high efficiency and low monetary cost mentioned above,adopting serverless computing platform to train Distributed Deep Neural Network(DDNN)workloads is becoming a trend,as it allows users to decompose a complex model training task into a number of functions without configuring and managing the cluster resources.Although serverless computing can provide users with a simple resource interface(i.e.,the number and memory size of functions),how to allocate function resources for the DDNN training workload remains challenging.The main reason is that inadequate function resource provisioning(i.e.,either under-provisioning or over-provisioning)can easily lead to unpredictable DDNN training performance in serverless platforms.Through analyzing our motivation experiment results of DDNN training workloads in AWS Lambda,such unpredictable performance of serverless DDNN training is mainly caused by two key factors.First,the network I/O bandwidth of Parameter Servers(PS)can easily become the resource bottleneck.Second,the small local batch size for DDNN training can lead to the low resource utilization of functions.To solve the performance issues above,this thesis designs and implements λDNN,a cost-efficient function resource provisioning framework,to provide predictable performance for serverless DDNN training workloads,while saving the users’ budget.Specifically,a lightweight analytical DDNN training performance model is built by leveraging the available PS network bandwidth and the influence of function CPU utilization caused by small local batch size.Based on such a model,a serverless computing resource allocation policy is designed for DDNN training workloads by analyzing the upper and lower bounds of the memory size and number of functions.λDNN aims to guarantee the DDNN training performance while reducing the budget of DDNN training workloads by optimizing the function resource provisioning plan.Extensive prototype experiments on AWS Lambda and complementary trace-driven simulations demonstrate that λDNN can deliver predictable DDNN training performance and save the monetary cost of function resources by up to 66.7%,compared with the state-ofthe-art resource provisioning strategies,yet with an acceptable runtime overhead.
Keywords/Search Tags:Distributed DNN training, serverless computing, predictable performance, function resource provisioning
PDF Full Text Request
Related items