| In Convolutional Neural Network(CNN),the lightweight network model adopts depthwise separable convolution instead of standard convolution,which greatly reduces the number of parameters and calculation of the network and only loses limited accuracy.This is conducive to the deployment of CNN in embedded platforms with limited storage and computing resources,and in areas with high real-time requirements,such as autonomous driving,drones,and mobile computing.Therefore,the research of CNN accelerator with depthwise separable convolution is of great significance.In this thesis,a computing engine with configurable computing mode and a feature map buffer array with configurable data bandwidth are designed.The computing engine is mainly composed of14 rows and 32 columns of PE units and data selectors.The data flow of input feature map and output feature map is controlled by the control data selection signal,so as to realize the systolic computing mode and column parallel computing mode.The systolic calculation mode can reuse the input feature map of pointwise convolution and standard convolution to reduce the power consumption of memory access.The column parallel computing mode can realize the parallel computing of different channels of depthwise convolution,and improve the throughput of the computing array when calculating depthwise convolution.The feature map buffer array is mainly composed of 14 rows and 32 columns of buffer units and data selectors.By controlling the read and write enable signals of the buffer units and the data selection signals of the data selectors,different buffer units can be read and written in parallel.The input feature map of different bandwidths can be provided to the computing engine in two different modes,and the output feature map of different bandwidths produced by the computing engine can be stored.Finally,based on the configurable computing engine and the feature map buffer array,this thesis designs a CNN accelerator with depthwise separable convolution.The accelerator circuit is logically synthesized under SMIC 40nm process.The operating voltage of the accelerator is 1.1V,the operating frequency is 200MHz,the bit width of feature map and weights is 8bit,and the circuit power consumption of DC synthesis is 286.9m W,the overall accelerator area is 6.14mm~2,the peak throughput is 179.2GOPS,and the energy efficiency is624.6GOPS/W.Then,the accelerator is implemented based on the Xilinx ZYNQ7100 FPGA development board,using Image Net as the data set,running the forward inference of Mobile Net V2,and the classification speed is 124.5 frames per second. |