Font Size: a A A

Performance Study Of Coded Computation For Distributed Machine Learning

Posted on:2021-02-23Degree:MasterType:Thesis
Country:ChinaCandidate:N S LiFull Text:PDF
GTID:2518306107497294Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the growth of models and data sets,running large-scale machine learning algorithms in distributed cluster has become a common method.This method divides the whole machine learning algorithm and training data into several tasks and each task runs on different worker nodes.Finally,the results of all tasks are combined by master node to get the results of the whole algorithm.When there are a large number of nodes in distributed cluster,some worker nodes,called straggler,will inevitably slow down than other nodes due to resource competition and other reasons,which makes the task time of running on this node significantly higher than that of other nodes.Compared with running replica task on multiple nodes,coded computing shows an impact of efficient utilization of computation and storage redundancy to alleviate the effect of stragglers and communication bottlenecks in large-scale machine learning cluster.Researchers have proposed many coding schemes to implement the tolerance of stragglers,but most of them focus on how to reduce the recovery threshold of these coding schemes.However,the performance overhead of these coding schemes based on distributed machine learning has not been investigated and analyzed.We define two performance indexes of coded computation for distributed machine learning.This paper studies and discusses this problem.This paper defines two performance indexes of coded computation for distributed machine learning.One of the performances is the task completion time of distributed machine learning algorithm,that is,the time required to complete the whole computing task.The other one is the total machine computing time of the whole computing task,that is,the sum of time that all workers in the distributed computing system using to calculate the computing task.In practical applications,the task completion time is related to the time that needed to wait for executing distributed machine learning algorithm,and the total machine computing time is related to the cost of computing tasks directly.Therefore,this paper focuses on the two types of the overhead of distributed machine learning algorithm.This paper calculates the density function of the ith worker completed the computation task in the distributed system composed of n workers and present the expressions of the task completion time and the total machine computing time for distributed computing tasks while the computing task completion time of workers under uniformly distributed scenarios.Then this paper compares and analyzes the task completion time and the total machine computing time of three coding schemes applied to matrix multiplication under this scenario,and provide a basis for scheme selection.After selecting a coding scheme as the calculation scheme,this paper also gives the method of parameter selection of the coding scheme under limited conditions,and gives a heuristic algorithm of parameter selection.
Keywords/Search Tags:coding technology, distributed machine learning, distributed computing, stragglers tolerate, coded computation, performance study
PDF Full Text Request
Related items