Design And FPGA Implementation Of Convolutional Neutral Network Acceleration Module

Posted on:2021-02-04

Degree:Master

Type:Thesis

Country:China

Candidate:Z W Mei

Full Text:PDF

GTID:2428330614967721

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

With the development of Artificial Intelligence(AI),Convolutional Neural Network(CNN)algorithms have been widely used in image processing,classification and recognition,target tracking and other fields.For the research of hardware acceleration of CNN,a design scheme of CNN acceleration modules is proposed.By designing the corresponding data paths and hardware architecture,based on the FPGA development platform,the acceleration modules have the characteristics of high efficiency,high performance and low BRAM resources consumption.First,for the characteristics of common CNNs,the key point is the acceleration design of convolutional layers.In the exploration of the hardware-accelerated design space of the convolutional layers,the loop rolling and the parallel computing are analyzed,and the impact of data reusing on memory access is summarized;By analyzing the effects of different partitioning methods on the amount of External Memory Access(EMA)data,the output height and the output channel are selected for convolution partition.The calculation method of the total amount of EMA data under different data reusing methods is proposed,which provides theoretical support in the design of caching and data reusing on CNN acceleration structure.Secondly,in order to adapt to different types of network layers in common CNNs,an unified acceleration design scheme for CNNs is proposed.In the data path,in order to optimize the EMA and reduce the use of on-chip memory resources,hierarchically caching and double buffer design are designed.The storage format of a specific method is designed by preferentially storing the data of the input channel and the output channel.The weight and input feature map are reused in different stages during internal data transmission.On the hardware architecture,parallel calculation is performed on the input and output channels,and it is mapped to the multiply-add array to enhance the adaptability to different convolution sizes.In the convolution acceleration module,a high efficiency computing array is designed which combines multiply-add tree with systolic array and a 5-stage pipeline is introduced to improve the acceleration performance.The pooling operation is subjected to two-dimensional separation processing in the pooling acceleration module to reduce the use of on-chip memory resources.The control module is used to perform data reusing under the convolution partition,and the convolution module and the pooling module are scheduled to complete the acceleration work together.Based on the Xilinx VC707 development platform,experiments show that the proposed acceleration structure has certain versatility for common CNNs.When running convolutional layers of Alex Net,it has a performance improvement of 5.3 times compared to Eyeriss.In terms of DSP Slices performance efficiency and BRAM resources consumption,it performs well compared with other FPGA acceleration schemes.Finally,a special CNN accelerator design scheme is proposed.To optimize External Memory Access,the appropriate block cache design and data reuse strategy for the specific network are selected.In the optimization design,by adding data process on the first convolutional layer and designing an independent fully-connected module,the efficiency is improved.By combining the pooling module into the convolution module,the total amount of data transferred from external memory is reduced.An acceleration system is built based on the Xilinx VC707 development board.Experiments show that the performance is improved by 1.56 times and 2.57 times after optimization.The parallel design and pipeline design of accelerator are compared.It is found that when the pipeline acceleration operation was performed in stages on VGG-16,it performs better than others on the DSP Slices performance efficiency and BRAM comsumption.Compared with embedded GPU TX2,the performance and energy efficiency are improved by 1.7 times;compared with FPGA-based related solutions,it has a 1.4-fold improvement in DSP Slices performance efficiency and 5.5 in performance of BRAM parameters while saving BRAM resources by more than 48%.

Keywords/Search Tags:

CNN hardware acceleration, the optimization of external memory access, high efficiency multiply and accumulate array, pipeline

PDF Full Text Request

Related items

1	Hardware Accelerated Design And Simulation Of Convolutional Neural Network
2	Evaluation of new multiply and multiply-accumulate structures in FPGAs
3	The Design, Optimization And Verification Of Fixed-point Multiply Accumulate For X-DSP
4	Design and implementation of digit-serial online multiply-accumulate arithmetic operations
5	Research And Implementation Of CNN-Oriented High Energy Efficiency SRAM Computing Array
6	Design and analysis of low-power and noise-tolerant multiply accumulate units
7	The Design And Implementation Of High-performance64Bit Fixed-point SIMD Multiply Accumulate For FT-XDSP
8	Design,Optimization And Verification Of The Floating-point MAC Unit For The 32 Bit High Performance M-DSP
9	The Design Of Floating-Point Multiply-Add Fused Units In General Purpose Processors
10	A Convolutional Neural Network Accelerator Based On FPGA