Currently, deep neural network plays a very important role in image processing, speech recognition and natural language processing field. However, due to the large amount of training data, result to the speed of model training is slow. In the past, to speed up the speed of learning by increasing the number of machines. Today, with promotion of the memory and computational ability on GPU(graphics processing unit GPU), it often uses GPUs to compute; but the growths of model are constrained by the small memory on GPU, larger models often can’t be stored on the GPU, resulting that the neural network with too many parameters can’t be trained on GPU.In view of the low efficiency on the deep neural network model training, it is proposed that multi-GPUs parallelize train the deep neural network models. In order to improve the efficiency of parallel training models, it’s optimized from the three aspects, to compute the deep neural network models in parallelism, first of all the model was divided into two parts which stored in the two GPUs respectively, so that two models can be parallel computed on two GPUs; optimize the models parallel training sequence, the deep neural network models use different parallel scheme in different areas, convolutional layers use data parallelization and in the fully connected layers use model parallelization computing. At the same time optimize the memory access to training data, add a data conversion layer into a parallel model structure to achieve data integration or exchange on GPU. Finally the training datasets are so big that we use parallel Mini-batch training methods to optimize data processing. Using the design of multi-GPUs accelerat deep neural network model parallelization, relying on a strong collaborative multi-GPUs parallel computing capability, combined with the data parallelization in deep neural network model parallel training, thus realizing the deep neural network model parallelization training accelerated.Based on Linux operating system and CUDA programming environment, an experiment uses MNIST, CIFAR10, and CAR datasets to test the algorithm. Results show that, the use multi-GPUs to parallel model training method compared with caffe methods under the premise of considerable accuracy, the training efficiency was increased by 20~30%, the loss of deep learning is smaller. At last deep neural network model parallelization method has been successfully applied to the automatic vehicle recognition systems. |