| Convolution Neural Network(Convolution Neural Network,CNN)is a multi-layered perceptron inspired by biological visual nerves.It is particularly suitable for extracting image features.However,the application of CNN in embedded devices requires low delay and power consumption.At present,the CNN application at the embedded devices is mainly based on CPU,which cannot take advantages of the parallelism of CNN to meet the delay requirements.FPGA has the characteristics of rich resources,programmability and low power consumption.It can fully make use of the parallelism of CNN through custom circuits to achieve acceleration.According to the characteristics of the model of convolutional neural network,this paper proposes a FPGA-based method for accelerating convolutional neural network,and selects VGG16 model for FPGA hardware acceleration analysis and implementation.Firstly,this paper adjusts the activation layer and quantifies the data of the convolutional network model.Secondly,Based on convolutional iterative reconstruction and data dependency analysis,the convolutional layer is accelerated from the perspective of parallel computing and pipeline optimization.Based on the Pingpong operation,the input/output feature cache and weight cache of the convolution layer are optimized to improve the performance of convolution layer acceleration.Thirdly,a specific hardware implementation of the largest pooling layer is raised to reduce resource consumption and delay.Last but not least,this paper analyzes the theoretical hardware resources occupied by the proposed convolutional network acceleration method and propose a estimation method for the usage amount of cache,calculation and bandwidth resources.According to the proposed acceleration method,we built a hardware acceleration system on the MZ7100 FA board.The synthesizing analysis experiment verifies the correctness of the resource estimation way.In addition,this paper trains the weights of the CNN and quantifies the weight with GPU.Finally,the software and hardware programs are deployed to the board for actual testing.Overall,the acceleration method proposed in this paper can achieve the performace of 24.9ms /image,that is,the performance of 308 GOP / S,and the power is only 12.1W. |