In recent years,convolutional neural networks have been widely adopted in computer vision tasks such as image classification,target detection and scene segmentation due to their high classification accuracy.The classification accuracy of convolutional neural networks increases as the number of network layers increases.However,as the network deepens,the network becomes larger,and the amount of computation required increases dramatically.It is a very time-consuming task to implement convolutional neural network algorithms using software.Various hardware accelerators have emerged to improve the computational performance of the CNN models and meet the real-time and low-power requirements of embedded devices.Among them,field programmable gate array has become an ideal platform for hardware accelerators due to its powerful parallel computing capability,high energy efficiency and high flexibility.In this paper,the Tiny YOLO algorithm with typical CNN network structure is implemented in hardware for acceleration.An accelerator architecture based on image fine-grained block strategy is proposed.In the hardware design,the padding scheme applied to the Line Buffer structure is proposed,which avoids the time redundancy and space redundancy problems of the software solution.In order to further improve the performance of the accelerator,this paper improves the computational parallelism by changing the calculating order of the first layer;improves the data transmission efficiency by optimizing the Line Buffer structure;and reduces the system latency through the ping-pong technique and the full pipeline design.The experimental results show that the performance achieved by this design is 270.16 GOP/s under 150 MHz working frequency.Compared to the CPU implementation,the speedup ratio is 6times;compared to the GPU implementation,the performance-to-power ratio is 9times;more importantly,the accelerator shows a 1.3x~1.7x speedup compared with the state-of-the-art technique based on FPGA. |