Font Size: a A A

Research On Acceleration And Storage Optimization Of Convolutional Neural Network

Posted on:2020-11-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:S J LiFull Text:PDF
GTID:1368330611492972Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Recently,Deep learning has been deeply and widely used in a varies of application domains which include but not limited to industry,service,medical,military.The artificial intelligent(AI)methods have been able to work better than humans in some areas.Therefore,deep learning technology has become a research hotspot in academia and industry.Optimizing the existing deep learning algorithms and make them run efficiently on the current hardware is the key to whether the algorithm can be applied to real life stably and maturely.Therefore,this paper studies the acceleration and optimization methods of storage and computation in convolutional neural networks,analyzes the network and computational characteristics of convolutional neural networks from different aspects.And analyze,solve,and validate the computation and storage optimization problem in several representative convolutional neural networks.The main content and innovations include:· Researching and analyzing the CNN storage optimization algorithm based on block matrix decomposition(Chapter 2)We propose three efficient approaches to perform convolutional ELM-LRF on GPU platform.They are blocked LU decomposition algorithm,blocked Cholesky decomposition algorithm and heterogeneous blocked CPU-GPU parallel algorithm.Our work can be concluded from the following aspects.First,these three algorithms first address the challenge that traditional ELM-LRF can not solve the large scale Moore-Penrose Matrix Inversion(MPMI)problem limited by global memory size on a GPU device.Second,an efficient blocked Cholesky decomposition algorithm is presented to accelerate MPMI according to the matrix feature(when the H'H matrix is definite)in the ELM-LRF model.Experiments results indicate the blocked Cholesky decomposition algorithm achieves about 2x speedup compared with blocked LU decomposition algorithm.Third,a heterogeneous blocked CPU-GPU accelerate algorithm is presented to make full use of resources on a GPU node to accelerate MPMI.Experimental results show that the performance of this approach is 5%-10% higher than blocked Cholesky decomposition algorithm.· Proposing a virtual mixed memory management algorithm for a large scale convolutional neural network.(Chapter 3)We propose a deep learning memory control strategy,named mixed memory Convolutional Neural Network(mm CNN).To the best of our knowledge,our work is first to provide a complete solution to infer any scale networks on any memory capacity size accelerators.We use part data translation between hosts and devices to make the whole network looks like running in an accelerator with unlimited memory capacity.On the basis of the idea above,this work further optimizes the memory management policy,balanced the data translation and the computation.We use the computation time to cover the additional data translation time with the help of asynchronous data translation technology.So that the whole system runs more efficient and fast.In our experiments,we run a feed-forward CNN process in an extremely small memory size(as low as 5MB)on a GPU platform.This result further saves more than 90% compared to the sate-of-the-art related work “v DNN”.Our work improve the scalability of interaction computation between human and memory-limited machine.This work makes some interactive applications such as face recognition running on local mobile device be possible.· Proposing a fast GPU acceleration algorithm of convolutional neural network based on image combination(Chapter 4)We present two scheduling algorithms to optimize the CNN feedforward process.The first one is an efficient image combination algorithm used to accelerate the feedforward process of CNN.The algorithm can further improve the feedforward speed of the entire network and enhance the usage rate of GPU cards.Given that the combination parameter in the algorithm directly affects the system performance,we propose a parameter-training algorithm according to the entire network architecture.The algorithm can obtain an appropriate set of parameters in a short time according to a certain experimental platform and provide good performance.In addition to speed,the feedforward procedure of CNN faces the scalability challenge.Training a CNN model consumes considerable memory space on the GPU with increasing network depth,and using CNN to detect image targets on a GPU card with a limited memory size is a difficult problem.Accordingly,we propose a light-memory-cost algorithm that can handle a large-scale CNN model under the circumstance of sacrificing speed insignificantly.The experiments show that our methods work well on different platforms and achieve impressive performance of speedup,in large images,we obtain a nearly 1.7x speedup,and in small images,we obtain a 7x speedup.· Proposing a full GPU based batch multi-task cascaded convolutional networks(Chapter 5)We propose a full GPU based batch multi-task cascade convolutional network which is carefully designed and optimized in each step to gain a superior speed performance.In addition,we present a novel parallel memory allocation strategy that further enable our algorithm to support batch operation so that the system throughput increases significantly.In our experiments,we run a feed-forward CNN process on a 480 p image with 300 fps.This result significantly further increase the inference performance over 600% compared to the sate-of-the-art related work “MTCNN”.Our work implements the face detection application in a faster way which has far exceed the real-time performance and makes this application more practical and powerful in many high throughput demand situations.
Keywords/Search Tags:Deep learning, Convolutional neural network, GPU computation optimization, GPU storage optimization, Face detection
PDF Full Text Request
Related items