| The volume of image data has been expanding quickly as artificial intelligence has progressed and mobile intelligent terminal equipment has become more widespread.Traditional manual interpretation methods have failed to suit image recognition’s actual needs.Convolutional neural networks(CNNs),as an emerging implementation of image recognition in the field of artificial intelligence,have made major breakthroughs in the artificial intelligent tasks.There are more and more scholars conducting research based on CNN.An emerging trend and practical application in both academia and industry is the application of convolutional neural networks to edge devices such as smartphones,drones,and artificial intelligence of things(AIo T)devices.However,the edge device usually operates at an environment with limited resource and power,where high performance yet low energy dissipation are strongly desired.So it is of great significance to improve the P(Performance)P(Power)A(Area)of the CNN circuit.Systolic array has been the crucial architecture for accelerating CNN since the success of Google’s TPU(Tensor Processing Unit).However,the traditional systolic array requires complex peripheral circuits to guide the fine-grained input feature and weight arriving at the designated procession element,and its loading/offloading delay are usually large.In this work,we propose high throughput and low delay dual-line-systolic array to accelerate the CNN.With the line-by-line vector-style systolic dataflow,the peripheral circuit was well simplified and the loading/offloading delays were greatly reduced.Compared with the traditional neural network accelerator CPU and GPU,Field Programmable Gate Array(FPGA)has the advantages of small size,low power consumption,high parallel computing capabilities,and low requirements for hardware platform configuration.FPGA acts as a hardware acceleration platform for CNN to implement acceleration strategies.Besides,to fully take advantage of the DSP(Digital signal processor)INT8 computation in FPGA,dual-line-systolic array is developed,by which the computation throughput can be doubled.Finally,the proposed accelerator is deployed on PYNQ-Z2 for practically accelerating VGG16 neural network,peek throughput of the convolution layer can reach as high as 107.21 GOPS,which has exceeded all of the previous works on the same hardware platform. |