Font Size: a A A

Research On Parallel Algorithm Of Convolutional Neural Network Based On GPU

Posted on:2019-03-14Degree:MasterType:Thesis
Country:ChinaCandidate:H Z CuiFull Text:PDF
GTID:2428330548995000Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
At present,convolutional neural network is one of the focuses of deep learning research.With its excellent recognition performance,it has received more and more attention.Because the convolutional neural network training process contains a large number of training parameters,tens of millions of calculations,etc.,it takes a lot of time to complete a usable convolutional neural network.Usually,GPU acceleration is used to complete the training.However,due to the complex GPU microarchitecture,there are many parameters in the training process of the convolutional neural network and the exchange speed with the computer main memory is limited.The programming process cannot fully combine the characteristics of the GPU and other factors,leading to convolutional nerves.The network training process cannot give full play to the computing performance of the GPU.Aiming at the problem that the convolution training process can not fully utilize GPU computing performance,this thesis optimizes the parallel algorithm of convolution neural network based on Kepler K40 architecture GPU.The main research work of this thesis includes the following two aspects:This thesis presents an improved convolution matrix multiplication algorithm.Firstly,the convolution neural network training process is studied,and the convolution calculation formula is deduced.The object of optimization is convolution matrix multiplication.Secondly,it analyzes the features of GPU microarchitecture,and gives some performance indexes.The key indexes that affect the performance of GPU are the shared memory,register cache and so on.Then,the task partition of convolution matrix multiplication is analyzed,and the algorithm is implemented by CUDA programming framework.Finally,the validity and correctness of the algorithm are verified by experiments.Bade on the improve convolution matrix multiplication algorithm,a cyclic expansion method is proposed.By analyzing the influence factors of the loop unrolling conditions and the number of loop unfoldings,the loop unrolling condition guarantees that the program will be launched under better conditions;the influence factor affects the size of the loop unfolding times.In order to solve the problem of the effective number of loop expansions,this thesis designs and implements the loop unfolding process,tests and finds the optimal expansion times.Finally,the validity and correctness of the process are verified by experiments.Experiments show the convolutional matrix multiplication algorithm based on the improved performance index is effective and its computational performance reaches 2115 GFLOPS.Based on this algorithm and the loop unrolling method,GPU computing performance reaches 2238 GFLOPS,compared with the pre-improvement computing performance,after the convolutional matrix multiplication of the computational performance increased by times.The multiplication of the optimized convolution matrix is applied to the convolutional neural network training process.The average speedup of the realization performance of this thesis is 1.91 relative to the Caff library,and the average speedup is 0.98 relative to the cuda-convnet.Therefore,the optimization of this thesis has a greater practical value.
Keywords/Search Tags:GPU Microarchitecture, Convolution Neural Network, Convolution Matrix Mutiplication, Loop Expansion
PDF Full Text Request
Related items