Font Size: a A A

Design And Implementation Of Convolutional Neural Network Acceleration Based On FPGA

Posted on:2019-03-21Degree:MasterType:Thesis
Country:ChinaCandidate:M LaiFull Text:PDF
GTID:2428330590492463Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Over recent years,Convolutional Neural Network(CNN)has developed rapidly in a wide range of applications such as artificial intelligence and machine learning.The demand for CNN acceleration is also growing.In the multi-layer nested loops of CNN,a large amount of same data is used in many independent operations.It greatly increases the pressure on data storage and read/write bit widths to load these data for many times.It becomes the keys of CNN acceleration techniques to optimize the operations of CNN loops,to increase the degree of parallelism among the independent operations and to increase the data reuse.Field Programmable Gate Array(FPGA)has abundant arithmetic processing units and reconfigurable built-in modes to provide the high parallel computing,which makes it become a new choice for CNN acceleration.In this paper,the optimization techniques of loops are used to optimize the multi-layer nested loops of CNN.In the adjacent convolution operations,there are lots of overlapped input data of kernel window.It can improve the data reuse obviously to load these data only once and move them in register groups.In this paper,the parallel computing is performed on the height and width of feature map,and the number of convolution kernels,which are expressed as PARA_X,PARA_Y,and PARA_KERNEL respectively.The operations of pooling layer and fully-connected layer are also optimized based on the three parameters,which fully reuse the arithmetic processing unit on the FPGA.The parallel parameters are limited by the on-chip resources of FPGA,so a parallel evaluation model is proposed in this paper to analyze and describe the constraints among DSP,storage of BRAM block and bits width of BRAM block.The parallel evaluation model can get the recommended value of parallel parameters meeting the target demand,and the accuracy and rationality of it are fully verified in the evaluation.In order to reduce the difficulty of custom extension of the CNN acceleration scheme,this paper provides the high-level language interfaces for generating Verilog code automatically.The underlying modules of hardware design are divided and abstracted to a set of templates,which can be used for the reconfiguration of hardware modules with different parallel parameters.Users can get the Verilog code without modifying the hardware design manually.The interfaces can rebuild the design of the modules according to the parallel parameters and the structure of CNN model,then generate a complete hardware design solution automatically.In the evaluation,all the test groups are generated by the high-level language interfaces.Because the details of hardware design are hidden,it greatly reduces the difficulty in using and improves the efficiency.The max data size of test CNN model is 200704 and the max kernel number of a convolution layer is 256.The test CNN model contains 2 convolution layers,2 pooling layers,and 1 fullyconnected layer.The best BRAM utilization can reach 91% or more and the time is 11 ms.
Keywords/Search Tags:Convolutional Neural Network, Optimization of Loop Operation, Hardware Acceleration, FPGA
PDF Full Text Request
Related items