Font Size: a A A

Design And FPGA Implementation Of Convolutional Neutral Network Acceleration Module

Posted on:2021-02-04Degree:MasterType:Thesis
Country:ChinaCandidate:Z W MeiFull Text:PDF
GTID:2428330614967721Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the development of Artificial Intelligence(AI),Convolutional Neural Network(CNN)algorithms have been widely used in image processing,classification and recognition,target tracking and other fields.For the research of hardware acceleration of CNN,a design scheme of CNN acceleration modules is proposed.By designing the corresponding data paths and hardware architecture,based on the FPGA development platform,the acceleration modules have the characteristics of high efficiency,high performance and low BRAM resources consumption.First,for the characteristics of common CNNs,the key point is the acceleration design of convolutional layers.In the exploration of the hardware-accelerated design space of the convolutional layers,the loop rolling and the parallel computing are analyzed,and the impact of data reusing on memory access is summarized;By analyzing the effects of different partitioning methods on the amount of External Memory Access(EMA)data,the output height and the output channel are selected for convolution partition.The calculation method of the total amount of EMA data under different data reusing methods is proposed,which provides theoretical support in the design of caching and data reusing on CNN acceleration structure.Secondly,in order to adapt to different types of network layers in common CNNs,an unified acceleration design scheme for CNNs is proposed.In the data path,in order to optimize the EMA and reduce the use of on-chip memory resources,hierarchically caching and double buffer design are designed.The storage format of a specific method is designed by preferentially storing the data of the input channel and the output channel.The weight and input feature map are reused in different stages during internal data transmission.On the hardware architecture,parallel calculation is performed on the input and output channels,and it is mapped to the multiply-add array to enhance the adaptability to different convolution sizes.In the convolution acceleration module,a high efficiency computing array is designed which combines multiply-add tree with systolic array and a 5-stage pipeline is introduced to improve the acceleration performance.The pooling operation is subjected to two-dimensional separation processing in the pooling acceleration module to reduce the use of on-chip memory resources.The control module is used to perform data reusing under the convolution partition,and the convolution module and the pooling module are scheduled to complete the acceleration work together.Based on the Xilinx VC707 development platform,experiments show that the proposed acceleration structure has certain versatility for common CNNs.When running convolutional layers of Alex Net,it has a performance improvement of 5.3 times compared to Eyeriss.In terms of DSP Slices performance efficiency and BRAM resources consumption,it performs well compared with other FPGA acceleration schemes.Finally,a special CNN accelerator design scheme is proposed.To optimize External Memory Access,the appropriate block cache design and data reuse strategy for the specific network are selected.In the optimization design,by adding data process on the first convolutional layer and designing an independent fully-connected module,the efficiency is improved.By combining the pooling module into the convolution module,the total amount of data transferred from external memory is reduced.An acceleration system is built based on the Xilinx VC707 development board.Experiments show that the performance is improved by 1.56 times and 2.57 times after optimization.The parallel design and pipeline design of accelerator are compared.It is found that when the pipeline acceleration operation was performed in stages on VGG-16,it performs better than others on the DSP Slices performance efficiency and BRAM comsumption.Compared with embedded GPU TX2,the performance and energy efficiency are improved by 1.7 times;compared with FPGA-based related solutions,it has a 1.4-fold improvement in DSP Slices performance efficiency and 5.5 in performance of BRAM parameters while saving BRAM resources by more than 48%.
Keywords/Search Tags:CNN hardware acceleration, the optimization of external memory access, high efficiency multiply and accumulate array, pipeline
PDF Full Text Request
Related items