Font Size: a A A

Design And Implementation Of Convolutional Neural Network Accelerator Based On RISC-V

Posted on:2022-02-05Degree:MasterType:Thesis
Country:ChinaCandidate:H XuFull Text:PDF
GTID:2518306602965249Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
In recent years,convolutional neural networks have been widely used in the field of computer vision because of their powerful feature extraction and classification capabilities as one of the mainstream neural networks.However,with the growth of application requirements and the improvement of accuracy requirements,the computational complexity of convolutional neural networks has become larger and larger,and higher requirements have been placed on the computing power and data bandwidth of the hardware platform.Based on the von Neumann architecture,the CPU control flow drive and the storage and calculation separation computing architecture,a large number of auxiliary work and frequent data work greatly limit the performance of processing,and also severely limit the parallelism of convolutional neural network processing.Compared with CPU,GPU has advantages when dealing with large-scale parallel operations.However,GPU power consumption is too high,and it is mostly used in the training of convolutional neural networks,which does not meet the power consumption requirements of the terminal responsible for inference computing equipment.In this thesis,a heterogeneous acceleration structure of convolutional neural network accelerator was proposed based on RISC-V open source processor for mobile embedded end.In this article.I first analyzed the calculation and pooling characteristics of the convolutional layer and the pooling layer in the convolutional neural network,and then used the loop optimization technology to optimize the convolutional layer and the pooling layer.the data reuse was reduced through the loop sequence adjustment.The cutting of the convolutional neural network and the hybrid parallelism of the input feature map and the output feature map was realized through loop block.The calculation of the convolutional layer and the pooling layer was decomposed into more fine-grained parallel vector calculations through loop expansion.And I put forward a special storage structure and data mapping method to cooperate with the efficient execution of the hardware optimized by the convolutional neural network.In view of the forward propagation of data between different layers of the convolutional network,I proposed two working modes,independent and joint.The convolutional layers of the same type of calculation could use independent modes to multiplex the convolution acceleration unit.The joint mode between the convolutional layer and the pooling layer of different calculation types was used to reduce the repeated loading of data and improve the calculation efficiency.In order to improve the calculation performance,I used ping-pong buffer and pipelined multiply-add tree to realize the calculation array of the convolution unit.At the same time,fixed-point quantization of the network parameters was performed to reduce the model size to simplify the logic and energy consumption of calculation and memory access.Due to the limited on-chip cache,in order to adapt to different scales of convolutional neural networks,A software-hardware co-design scheme was proposed that cuts the convolutional neural network by software rationally and then maps it to the hardware accelerator to ensure the versatility of the hardware accelerator.Finally,the specific implementation of the convolutional neural network accelerator hardware was introduced,and the convolutional neural network accelerator was integrated into the So C system of the RISC-V processor through the AXI4 and APB protocols.In this thesis,five convolutional layers and three pooling layers in the classic Alex Net convolutional neural network are used as the test data to realize the behavior simulation of the entire convolutional neural network accelerator.The average utilization rate of the computing array is 81.44%.The calculation cycle of forward propagation is reduced by1.427% in the joint mode.Based on Xilinx's ZCU102 XCZU9EG-2FFVB1156 FPGA platform implementation,the maximum operating frequency is 110 MHz,the calculation array of the convolution unit consumes 315 DSPs,the on-chip cache consumes 57 blocks of BRAM,and the peak computing power of the convolutional neural network accelerator is 56.32 GOPS.
Keywords/Search Tags:Convolutional Neural Network, RISC-V, Hardware accelerator, Optimization of Loop Operation, Parallel acceleration
PDF Full Text Request
Related items