Font Size: a A A

Inference Optimization Of Deep Neural Network Based On Roofline Performance Model

Posted on:2022-04-10Degree:MasterType:Thesis
Country:ChinaCandidate:Z T ZhangFull Text:PDF
GTID:2518306605971749Subject:Circuits and Systems
Abstract/Summary:PDF Full Text Request
In recent years,deep neural networks have achieved great success and have been widely used in fields such as autopilot,speech recognition,face recognition,object detection,and semantic segmentation.Because of the graphics processing unit(GPU)has obvious advantages in terms of computing power,memory bandwidth,and energy efficiency,the use of GPU has become an important means of accelerating deep neural network training and inference.With the development of deep neural networks,deep neural networks have higher requirements for memory resources,but also make training and inference more timeconsuming.In the training phase,large-scale clusters can be used for training,and there is generally no need to consider real-time and computing resource constraints.In practical applications,the trained neural network model is used for reasoning.In many scenarios,the available GPU does not have strong computing power and memory resources to ensure the memory requirements and real-time performance of neural network model inference.Therefore,in application scenarios where the memory resources of computing devices are insufficient to support neural network inference or the inference delay is strictly required,how to reduce the complexity of the neural network and the inference delay while ensuring the accuracy of the model has become a research hotspot.At the same time,when the neural network is running on the GPU,the selection of kernel configuration parameters has also become an important factor that affects the inference performance of the neural network.How to efficiently obtain the appropriate configuration parameters,make the deep neural network run with the best configuration parameters in the inference stage,and achieve the best performance has become an interesting direction.This article will start from these two aspects,combined with the Roofline performance model as a guide,to solve the problems of high memory requirements and large inference delays faced by deep neural networks in inference on GPU.In response to the above problems,the main contents of this article are as follows:(1)Propose a model pruning algorithm based on genetic algorithm.From the aspects of model pruning problem abstraction,algorithm design,algorithm implementation,etc.,this paper introduces in detail a convolutional neural network model pruning algorithm based on genetic algorithm;finally,introduces the implementation details of the model,parameter configuration,and the use of different data sets The performance of the classic neural network model is compared and analyzed with previous work in related fields.Taking the performance of VGG16 on the CIFAR-10 data set as an example,the algorithm in this paper improves the accuracy of the model by 0.17%,while trimming 73.05% of the calculation and 91.06% of the parameters,and the inference time is shortened by 35.2%.(2)Propose a GPU parameter Auto-tuning Framework based on Bayesian optimization algorithm(GAFB).Parameterize the factors that affect the performance of the GPU program,and combine the Bayesian optimization algorithm to search for the best configuration parameters.Bayesian optimization uses the time of each sample as a priori to guide the next sampling,and excellent configuration parameters can be obtained with fewer samples.This paper uses four classic image processing operators to test GAFB and compares with other optimization algorithms to prove that GAFB can obtain satisfactory results with fewer samples.Finally,GAFB was tested using AlexNet's convolutional layer.Compared with the original parameter configuration,AlexNet's inference speed increased by 50.09%.
Keywords/Search Tags:Model Compression, Auto-tuning, GPU, Bayesian Optimization, Inference, Genetic Algorithm
PDF Full Text Request
Related items