Research On Acceleration Of Convolutional Neural Networks On ARM Embedded Platforms

Posted on:2020-11-30

Degree:Master

Type:Thesis

Country:China

Candidate:Q Li

Full Text:PDF

GTID:2428330590983215

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the development of deep learning technology,the accuracy of deep learning algorithm is constantly improving.The way to replace machine learning algorithms through deep learning is widely recognized.Convolutional neural networks are an important development branch of deep learning in the field of computer vision,which are used in the field of graphic image processing.In the era of ever-changing mobile devices,mobile applications occupy most of the application market.Applying convolutional neural networks to mobile terminals can not only expand the range of mobile applications,but also effectively reduce the cost of convolutional neural network deployment.However,the computing power of the mobile device is quite different from that of the desktop and server devices.The convolutional neural network is slow to calculate on the mobile device and cannot meet the timely response requirements of the application.Therefore,Building a mobile convolutional neural network framework with excellent performance is the key to promoting the deployment of the convolutional neural network on the mobile side.The depth separable convolution and model quantization are used to reduce the parameter amount and calculation amount of the network,and the ArmV7 CPU and Mali GPU in the mobile terminal are used to accelerate the forward calculation process of the convolutional neural network.The assembly level Neon instruction and OpenMP multi-core technology are used to accelerate the convolutional layer and activation function layer calculation in the CPU,while the GPU side calls the shader in the Mali GPU to process the core acceleration convolution and activation function calculation through the OpenCL heterogeneous parallel framework.Multiple computing resources are made full use of on the mobile terminal to accelerate the forward calculation process of the convolutional neural network.On the firefly RK3399 development board,8 bits quantized MobileNetSSD general object detection network forward is realized by various frameworks.The Mali T860 GPU completes a forward with 210 ms,while two Cortex A72 cores in the Arm CPU complete a forward with 260 ms,and two computing devices serially complete a forward with 190 ms,and the frame rate of video processing can reach 5fps.Because Mali GPU maps fewer threads in the OpenCL framework,and the OpenCL vectorization is used to accelerate the inner product calculation of the sliding window and the convolution kernel,The Mali T860 GPU completes a forward faster when the input layer of the convolution layer is smaller and the channel number is larger.While the 128-bit Neon commands is used to calculate four 32-bit arithmetic logic operations at a time in Arm CPU,when the convolution layer input has a large width and a large number of channels,the Arm CPU calculates faster.The Mali GPU and the Arm CPU are used to serially calculate the layers in the network to maximize the advantages of the two computing devices and accelerate the forward process of the convolutional neural network.

Keywords/Search Tags:

heterogeneous parallel computing, single instruction multiple data, cpu gpu join calculation, depth separable convolution, model quantization

PDF Full Text Request

Related items

1	Studies On CRS Crossbar Based Single-Instruction Multiple-Data Stream Computing Architectures
2	Research On LDPC Codes Parallel Decoder Under Multi-core Technology
3	Deep Convolution Neural Network And Its Application In Ground Image Target Recognition
4	Low Power And Multi-precision Computing Circuits Design For Irregular Network Layers In Neural Networks
5	Design Of Configurable And Extensible Media Processor
6	Research On Iris Recognition Method Based On Lightweight Neural Network
7	Design And Implementation Of A Parallel CNN Framework Based On Heterogeneous Computing
8	Optimization Of Convolution Based On CPU SIMD Instruction Set
9	On Optimization Of Join Query Algorithms For Massive Data
10	Implementation And Evaluation Of Big Data Parallel Join Algorithms