Font Size: a A A

Research On AI Accelerated Inference Technology Based On Embedded GPU

Posted on:2022-10-13Degree:MasterType:Thesis
Country:ChinaCandidate:J S HuangFull Text:PDF
GTID:2518306485956729Subject:Computer technology
Abstract/Summary:
With the rapid development of computer vision,a variety of state-of-the-art algorithms have emerged in common visual tasks.However,when various new ideas and tricks are used to refresh the upper limit of accuracy,the problem of speed is often ignored.Convolutional neural network usually has a large number of parameters,occupying more storage space and computing resources.Due to the consideration of power consumption,size and other issues,embedded devices in practical applications often have limited memory space and computing power.Therefore,deep learning applications are unable to be well deployed in such scenarios.In the training process,due to continuous iterative learning,the weight of the model is always updated.In the inference stage,the parameters of the model have been fixed,so there is no need to introduce redundant parameters to ensure the convergence of the network.Removing these redundant parameters often does not affect the performance of the model too much,but can greatly reduce the amount of calculation,thus reducing the memory consumption and time-consuming of inference.The speed of inference largely determines the practicability of the algorithm and the real-time performance of the algorithm when it is deployed on various hardware platforms.Therefore,in the application level of computer vision task,more attention should be paid to the speed of inference and the portability and reliability of the algorithm on the embedded hardware platform.Based on the above analysis,this paper mainly studies the following contents:(1)In the forward inference process of Convolutional Neural Network,the calculation amount of convolution layer and full connection layer(realized through convolution)is huge,which is the main factor affecting the time consuming of inference.Therefore,efficient convolution realization is helpful to improve the efficiency of convolution calculation.At present,convolution is realized in CNN mainly through three ways: img2col+gemm,winograd and fft.The first two methods are realized by matrix multiplication,so the realization and optimization of matrix multiplication on the CPU side and GPU side are studied,and the calculation efficiency of matrix multiplication is optimized mainly from two aspects of calculation and memory access.On the CPU side,memory alignment,matrix partitioning,use of registers to reduce access time,Single Instruction Multiple Data instruction set acceleration.On the GPU side,CUDA parallel computing,matrix partition,shared memory and other ways are used to accelerate.and finally,Cublas library is called to increase the floating-point computing times per second by nearly63 times.(2)Study the acceleration of Tensor RT on YOLOv3 object detection algorithm,Tensor RT uses operator fusion,network quantization and other strategies to accelerate network forward inference.Besides,Tensor RT has C++API,so it has good portability on hardware platform.The experiment compared the performance of YOLOv3 and YOLOv3-tiny object detection algorithm accelerated by Tensor RT on COCO data set and homemade UAV data set in terms of accuracy and speed.Validated on the NVIDIA Jetson TX2 embedded GPU platform,after acceleration,the average inference time for a single image of YOLOv3 and YOLOv3-tiny was reduced to50.7% and 43.0%,respectively.(3)The lightweight basic network Mobile Net V1 is introduced into object detection,and the depthwise separable convolution is used to replace the standard convolution to realize the acceleration and compression.By pruning the Mobile Net V1-YOLOv3 network and the convolutional layer filters are sorted in importance according to the L1 norm.Filters with low scores are considered to contribute little to the detection results,and removing them will not affect the detection accuracy too much,reduce the amount of calculation and improve the speed of inference.The network is quantized,and the model compression is realized by using low-precision data representation,mainly quantizing the standard convolution and the depthwise separable convolution.Furthermore,CUDA accelerated image preprocessing and multi-thread prediction were adopted.Test on TX2 platform shows that the speed of inference is increased from 5.0FPS to 21.97 FPS,which is 439.4% of the original.Through network quantization,the memory occupied by the model is reduced from 97.2M to 34.5M,which is 35.49% of the original.
Keywords/Search Tags:Matrix multiplication, Object detection, Pruning, Quantization, Embedded GPU
Related items