Font Size: a A A

Research On Parallel Optimizaion For Deep Learning Algorithms And Applications

Posted on:2020-01-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:L ShanFull Text:PDF
GTID:1368330611493058Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Deep learning algorithms learn the hierarchical representation of features in big data through a multi-layered network structure.These hierarchical features indicate that computers allow the construction of simpler concepts to learn complex concepts.The difference between the deep learning algorithm and the traditional machine learning algorithm is that,deep learning extracts the feature representation from big data effectivly to obtain high processing performance.Due to the high efficiency,deep learning algorithms have achieved great success in applications such as speech recognition,image classification,natural language processing,and video recommendation.However,with the increasing sophistication of deep learning models and the growing amount of data to be trained,the optimization of deep learning algorithms becomes a challenge.With the development of high-performance computing,heterogeneous computing played an important role in the rising of deep learning.Parallel optimization algorithms for deep learning acrossing computing nodes in the cluster have recieved much attention for recent years.However,due to the inherent serialization of deep learning optimization algorithms and the existing shortcomings of parallel optimization algorithms,the scale of parallel optimization algorithms is limited.In this paper,parallel optimization algorithms and training acceleration techniques are reserched to accelerate the process of deep learning model training,and to support a variety of deep learning based applications.The main work and innovation of this article are summarized as follows.(1)Delay compensated asynchronous adam algorithm for deep neural networks.In this paper,we propose a delay compensated asynchronous Adam(DCAdam)algorithm to train DNN.In particular,DC-Adam updates the parameters with the moment increment which is the division of the first and the second moments to take the advantage of the original Adam algorithm,and compensates the gradient with the first-order component in its Taylor expansion.Since the delay compensation technique reduces the error of delayed gradients,and the moment increment further counteracts the influence of approximated compensation,DC-Adam converges much more rapidly than ASGD on a computer cluster with moderate computing nodes.We theoretically analyze the Ergodic convergence rate of DC-Adam and compare with DC-ASGD.At the same time,the DC-Adam algorithm requires only a small amount of extra memory overhead and is easy to implement.The DC-Adam algorithm can be applied to parallel train a variety of deep neural network models.(2)Two level parallel stochastic gradient decent for deep neural networks.Although the DC-Adam algorithm is efficient,the influence of the delay gradient still exists.There is still some error between compensated gradient and the exact gradient.When the number of computing nodes increases to a certain number,the trained model will introduce some loss of accuracy.While the synchronous SGD(SSGD)algorithm does not suffer the gradient delay problem,but the synchronization overhead increases with the number of computing nodes.Without proper training strategy,SSGD algorithm will also lose some accuracy as the number of nodes increases.When the number of nodes exceeds a certain number,the accuracy of the model will also decrease.In order to extend the training to a larger computing node,this paper presents a two-level parallel algorithm to accelerate the training of deep neural networks by adding group servers.DC-Adam algorithm and SSGD algorithm are effectively combined to reduce the impact of the disadvantages of the two algorithms.Due to the effectiveness,the two-level parallel algorithm extends the training to larger number of computing nodes.(3)Asynchronous consistent stochastic gradient descend algorithm for deep neural networks.We propose an asynchronous consistent stochastic gradient descend(ACSGD for short)algorithm to parallel optimize the deep neural networks effectively.ACSGD algorithm combines the advatages of DC-Adam and varince reduction method.In ACSGD algorithm,the parameters are updated by the moment increment of the consistent gradient.The consistent gradient means the delayed gradient is first delay compensated and then variance reduced,which make it more stable with lower deviation to the full gradient.The moment increment adaptively decides the learning rate which further reduces the influence of the deviation on the convergence.By using delay compensation,variance reduction,and adaptive learning rate,ACSGD achieves much higher parallel efficiency and convergence rate than ASGD,when implemented in a computer cluster with moderate number of computing nodes.(4)Heterogeneous acceleration for CNN training with many Integrated core.While improving the scalability of deep learning algorithm training,this paper also improves the ability of a single computational node to calculate the gradient.For convolution neural network(CNN),this paper presents a training acceleration method based on CPU+ MIC heterogeneous computing.Taking full advantage of the tremendous parallel computing power provided by heterogeneous computing,the method in this paper can make full use of the parallel computing resources of modern supercomputers.By observing the computation time of CNN in different layers in Caffe framework,we find that convolution layer takes the most overhead in the whole computation.Based on the above observations,this paper uses the CPU + MIC heterogeneous computing method to accelerate the convolution part.In order to make full use of the hardware thread resources provided by MIC co-processing,this paper sets up two levels of thread parallelism,namely data thread and MKL thread,and gives the theoretical analysis and practical verification of the thread setting scheme.In this paper,the computational hotspot in CNN training,that is,the convolution operation is parallelized and loaded into the MIC coprocessor for execution.We take full advantage of the OpenMP and Intel MKL technologies to maximize the parallel computing resources provided by the MIC coprocessor.(5)Parallel training technique for Cross-Modal Information Retrieval.Crossmodal retrieval is an important application of deep learning algorithms.With the development of deep learning algorithms,cross-modal Hash algorithm based on deep learning has drawn much attention.The deep cross-modal Hash algorithm takes advantage of the powerful feature extraction capabilities of deep neural networks,combined with the cross-modal Hash algorithm,resulting in greatly improved cross-modal retrieval accuracy.However,the training of deep cross-modal Hash model has become a challenge due to the combination of the neural network extracting the features of the two modal data,and the larger parameter scale than one single neural network.By analyzing the training process of DCMH algorithm,this paper proposes to use DC-Adam algorithm to optimize the training process in parallel,which can greatly improve the training speed of the model while ensuring the accuracy of retrieval.
Keywords/Search Tags:Deep Learning, SGD, SSGD, ASGD, Parallel Computing, Delay Compensation, Adam, Variance Reduction, Consistent Gradient, Heterogeneous Computing, Cross-Modal Information Retrieval
PDF Full Text Request
Related items