| Convolutional neural network(CNN)has been widely used in the field of computer vision.However,CNN model has the characteristics of computational-intensive and memory-intensive,and it is challenging to implement CNN models efficiently on embedded devices with relatively limited resources.In this paper,the corresponding acceleration structure developed by HLS tool is proposed.Our works include:(1)Aiming at the cumbersome and complicated development process of developing FPGA-based CNN accelerators using hardware description language,an efficient FPGA accelerator based on HLS tool is proposed in this paper which can support the acceleration of multiple CNN models.Three representative CNN networks were selected:VGG-16,ResNet-18 and SqueezeNet for acceleration.(2)The performance of CNN accelerators based on HLS development has been improved by multi-level optimization.The loop unrolling and tiling optimization are applied.The input data reuse technology based on loop interchange is adopted to avoid repeated external memory access.The FIFO-based convolution input data bus is used to reuse the input data,which optimizes the efficiency of memory access and the reusability of data.Two Fully Pipelined processing Units(FPU):Data-Fetching Unit(DFU),Calculation and Accumulation Unit(CAU)are employed in convolution module to improve the efficiency of calculation.In the convolution module,the pipeline implementation based on the ping-pong buffer is adopted.Three parts of the convolution module:data input,computing and storage can be started simultaneously without mutual dependence,which makes the computation time overlap with the data transfer overhead between DRAM to the on-chip buffers of FPGA.Reuse input data to reduce redundant data transmission for the branching features of the ResNet network.(3)In the overall architecture of the system,the efficiency of data transmission is improved by a streaming structure based on multiple DMA channels.In terms of data form,fixed-point data style is used to optimize calculation and storage efficiency.Acceleration of various convolutional neural network models is implemented on the Xilinx ZC706 platform.Experimental data shows that the ResNet-18 model can achieve a throughput of 227.8 GOPS on the accelerator,and the VGG-16 model can achieve a throughput of 162.9 GOPS.The ResNet-18 model achieved a resource efficiency of 3.08 GOPS/kLUT and the energy efficiency reached 167.5 GOPS/W. |