Font Size: a A A

Research On Data Parallel Communication Strategy For Distributed Machine Learning System

Posted on:2022-09-25Degree:MasterType:Thesis
Country:ChinaCandidate:X L WangFull Text:PDF
GTID:2518306545955419Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid growth of machine learning algorithm model scale and data volume,a single node cannot effectively bear the computing and storage requirements required for largescale training.Therefore,running large-scale machine learning algorithms in distributed clusters has become a common method.The key to distributed machine learning lies in how to solve the problems of dividing training data,allocating training tasks,allocating computing resources,and integrating distributed training results to achieve a balance between training speed and training accuracy.Because the main contradiction in the field of large-scale machine learning is still the slow training speed caused by the excessive amount of training data,the current commonly used distributed machine learning method is to use data parallel methods to solve the problem of excessive machine learning training data.The more commonly used distributed machine learning communication strategy is the overall synchronous parallel strategy,but the training speed of the distributed machine learning in the overall synchronous parallel strategy is restricted by the slowest computing node in the cluster,which results in the model training speed being too slow.In response to the above problems,the academic community has proposed another parameter communication optimization strategy:asynchronous parallel strategy.This strategy utilizes the computing performance of the cluster to the greatest extent,but causes the global model parameter update to become delayed and inconsistent,which will make Model convergence is affected.In view of the problems of the above two communication strategies,relevant researches on the communication strategies of distributed machine learning are carried out,the main contents are as follows:1.Combine the synchronous parallel strategy with the asynchronous parallel strategy,simulate the unbalanced calculation speed of each node in the experiment,and divide the nodes into different groups according to the calculation speed through the algorithm,and use the synchronous parallel strategy for the nodes in the same group For training,for different groups,asynchronous and parallel training is used.This grouping method can reduce the synchronization overhead of the nodes in the group,and because each group of nodes aggregates the calculation results of multiple nodes when updating the global parameters,it can reduce the convergence of the model caused by asynchronous parallelism.The effect of slowing down.Through the comparison of related experiments,the hybrid parallel strategy can achieve better performance than the synchronous parallel strategy and the asynchronous parallel strategy when the node speed is not balanced.2.Because the nodes in the distributed cluster have problems such as resource competition,machine performance differences,and unexpected failures,there will be lagging nodes,which makes tasks running on this node take significantly longer than other nodes.When there are severely lagging nodes,the performance of the synchronous and asynchronous hybrid parallel strategy is not very good.Therefore,the gradient coding technology is introduced into the group of mixed parallel strategies,and a new grouping strategy is proposed.Related experiments prove that gradient coding technology can reduce the impact of backward nodes in distributed machine learning,thereby further accelerating the speed of distributed machine learning training and improving the training efficiency of the entire model.
Keywords/Search Tags:synchronous parallel, asynchronous parallel, distributed machine learning, data parallel, gradient coding
PDF Full Text Request
Related items