| In recent years,with the vigorous development of artificial intelligence technology,convolutional neural network algorithms have been widely used in embedded application scenarios such as image classification and face recognition.However,as the complexity of various application requirements continues to increase,factors such as speed,power consumption,and huge parameter volume are the main bottlenecks that restrict the deployment of convolutional neural networks on embedded mobile devices with limited resources and power consumption.Due to the faster iteration of network models,the deployment efficiency of a variety of network models on hardware equipment needs to be improved.Therefore,how to efficiently deploy network models on embedded mobile devices and achieve better hardware acceleration has become a current research hotspot.Compared with central processing unit(CPU)and graphics processing unit(GPU),Field Programmable Gate Array(FPGA)has become the hardware platform of mainstream mobile devices due to its high parallelism,hardware programmability and low power consumption.This paper conducts research from two aspects of accelerator architecture and model efficient deployment,and proposes a design scheme of convolutional neural network accelerator based on ZYNQ platform by adopting the method of software and hardware co-design,using depth learning compiler technology to efficiently combine hardware and software,so as to realize a complete image classification application system.The FPGA hardware part uses Verilog language to design each module of convolution operation,and refers to the hardware part of Nvidia Deep Learning Accelerator(NVDLA),and optimizes the circuit on the basis of it.By independent and flowing each convolutional computing module,the accelerator can scatter different modules according to the different structures of a variety of network models,realize a variety of network models on the embedded device,and achieve a good acceleration effect.The ARM processor controls the system and runs software programs.FPGA and ARM complete data communication through the AXI protocol bus,and finally complete the design of the entire accelerator system-on-chip(SOC)hardware.In terms of accelerator software design,in order to improve the deployment efficiency of the network model and solve the problem of limited resources of embedded devices,the Tengine AI reasoning framework tool is built and used on the embedded platform to perform operator fusion and int8 quantization on the network model,which greatly reduces the model memory and the occupation of hardware resources.Using C++ language to write code to connect Tengine framework with hardware accelerator,and designed an application program interface(API).Users can directly use the API to input network models and images to call the accelerator,so as to achieve efficient deployment of network models on accelerator hardware.Based on the ZCU104 development board and camera peripherals to build image classification physical platforms,pictures can be collected through the camera,and the image is automatically transmitted into the accelerator system to perform image classification tasks,Through the classification experiment of eight physical objects,it is verified that the platform has a certain stability.Three networks of Le Net5,Alex Net and Res Net18 were trained using the Tensorflow framework,which are deployed on the accelerator for board level verification to obtain test results.According to the analysis of the experimental results,its inference speed is about 10 times that of the CPU,and there is little difference in the inference speed compared to the GPU.Its power consumption is only about 1/9 of the GPU,and its energy efficiency ratio is 8 times that of the GPU platform and 61 times that of the CPU platform.Because the image data has gone through the process of int8 quantization,the image classification accuracy of the accelerator is about 3% lower than that of the CPU and GPU,within a reasonable range,which verifies the correctness of the accelerator design.Compared with related accelerators in other literatures,this design also has certain advantages in terms of speed and energy efficiency. |