Font Size: a A A

Optimal Design And Implementation Of Distributed Deep Learning Training

Posted on:2021-04-20Degree:MasterType:Thesis
Country:ChinaCandidate:C G QuFull Text:PDF
GTID:2428330611962808Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In recent years,with the explosive growth of available data and the continuous optimization of related algorithms,in-depth learning has made significant breakthrough applications in many areas of artificial intelligence.For example,the innovative applications of speech recognition,unmanned driving and image recognition make "artificial intelligence" a "new favorite" in the current Internet era.However,the development of in-depth learning improves people's quality of life,at the same time,the size of datasets used for in-depth learning training has become increasingly large,and the structure of algorithm models has become more and more complex.When a model based on a deep neural network is trained,it may have millions of parameters if no pruning is done.The problem is that there is insufficient memory storage and computing resources.In the environment of industrial production,in order to improve the training efficiency and reduce the training time of deep learning model,we generally use parallel design to perform training tasks,and use multiple working nodes to train a good performance deep learning neural network model in a distributed and efficient way.Distributed parallelization technology,as an accelerator in the training process of deep learning model,can significantly improve the training efficiency of deep learning through the advantages of multiple machines and multiple cards.As a result,the optimal design of distributed training based on deep learning has become an important research direction in the field of artificial intelligence.Many professional researchers and companies have optimized and improved the distributed architecture and algorithms of deep learning from different research perspectives,and summarized a lot of experience and methods on distributed training.However,in the process of practical application research,the following problems are found:(1)In many cases,multi-layer and complex neural networks are difficult to run in a single computing unit with limited memory,so that they cannot be trained with large-scale data,and the long training period of a single machine affects the overall development or scientific research cycle.However,the code logic of distributed training framework based on Tensorlfow differs greatly from that of single-machine version,which makes it difficult for users to modify their single-machine code into a distributed framework structure before each distributed training.(2)In distributed training based on deep learning Tensorflow training framework,there are two parameter updating mechanisms: synchronous updating and asynchronous updating.Because Tensorflow's distributed training framework structure is based on ParameterServer architecture,which leads to parameter sharing between computing server nodes through parameter servers as a medium,results in reduced communication efficiency of computing nodes,especially under synchronous update mechanism,which seriously affects the effectiveness of distributed training.(3)In the process of distributed training in heterogeneous environment,due to the asynchronous parameter update mechanism or the broadband size and transmission efficiency between nodes,there will be a large gradient delay.The stability and timeliness of the distributed random gradient descent algorithm are seriously affected and a serious decline occurs.When conducting distributed training,it is found that the efficiency and accuracy of the actual operation are always quite different from the expected results.In response to the above problems,this topic is devoted to the design and implementation of the parallel acceleration training platform system for deep learning.The main contributions are as follows:(1)For distributed Tensorflow training using the model structure of the deep neural network,users need to change the single-machine version Tensorflow training code into a distributed framework structure,which makes it difficult to modify the user,difficult to do distributed training,and impossible to do distributed training due to the Tensorflow version problem.From the perspective of asynchronous update mechanism,this paper presents a new distributed framework NFDT(New Framework for Distributed Tensorflow)framework as a server by parsing and splitting the native Tensorflow distributed framework into a client-side architecture,in which the user's single-machine Tensorflow code framework is used as the client and the remaining framework structure is redeveloped into a new distributed framework NFDT(NFDT)framework.The NFDT framework,as a service end,establishes a communication connection with the singlemachine version framework as a client.By calling the model structure,optimizer,loss function of the single-machine version Tensorflow framework and forming the complete conditions for distributed training,the single-machine version Tensorflow code can quickly realize distributed training.(2)In the process of building the distributed training platform system,we found that under the synchronous update mechanism,the distributed Tensorflow will be greatly affected by the effectiveness of training because it is based on the Parameter Server architecture.In this paper,a synchronous update mechanism is proposed to fuse the distributed Horvovd framework with the Tensorflow framework.Split the Horvovd Distributed Framework into a structure similar to the client-side architecture,use the single-machine version of Tensorflow code as the client,redevelop the remaining distributed structure based on it,and encapsulate it into a Horvovd Framework for Distributed Tensorflow(HFDT).The HFDT framework,as a server,establishes a communication connection with the single-machine version framework as a client.By calling the model structure,optimizer,loss function of the single-machine version Tensorflow framework and composing the complete conditions for distributed training,the single-machine version Tensorflow training code can quickly realize distributed training.(3)In this paper,the mechanism of parameter updating in heterogeneous environment is analyzed.Asynchronous updating brings delay,and the influence of high latency updating value on global parameters is the main reason for the decrease of algorithm efficiency.A new dynamic delay compensated asynchronous random gradient descent algorithm(DDC-ASGD)is presented,which dynamically adjusts the reliability of impulse and delay gradient terms according to the size of delay each time,and reduces the impact of asynchronous delay.Experiments show that DDC-ASGD greatly improves the performance of the model,solves the two problems of the impulse delay compensation algorithm(DC-ASGDK)which is limited by the number of working nodes and lacks the dynamic adjustment delay mechanism,and further reduces the impact of delay on asynchronous parallel.Compared with ASGD and DC-ASGDK algorithms,DDC-ASGD achieves higher accuracy and better model performance on both Fashion-mnist and cifar-10 datasets.
Keywords/Search Tags:deep learning, distributed training, stochastic gradient descent, heterogeneous cluster
PDF Full Text Request
Related items