Font Size: a A A

Research And Implementation Of FPGA Accelerated Convolutional Neural Network Training

Posted on:2019-01-18Degree:MasterType:Thesis
Country:ChinaCandidate:X S WeiFull Text:PDF
GTID:2428330572450302Subject:Measuring and Testing Technology and Instruments
Abstract/Summary:PDF Full Text Request
The Convolutional Neural Network(CNN)is a deep learning model with a multi-layered structure that extracts complex features from high-dimensional data through large-scale training and learning.Compared with the inference process,the processing flow of the training process of the convolutional neural network is more complicated and the calculation amount is more enormous.At the same time,it involves the transmission of large amounts of data and the buffering of intermediate data.In order to solve more abstract and complex problems,the scale of convolutional neural networks is increasing.So the traditional general-purpose computer platforms with serial operarion mode has been difficult to meet the needs of network training.Field Programmable Gate Array(FPGA)with a large number of logic and arithmetic units has outstanding advantages in performance,parallel computing,power consumption,and size,so that FPGA is very suitable for accelerating the training process of convolutional neural networks.The training process of convolutional neural networks is deeply studied,meanwhile the intra-layer parallelism and inter-layer parallelism of the training process are analyzed,then this paper analyzes and references the current hardware architecture.Based on this,this paper presents a new hardware training framework based on Zynq series FPGA architecture for accelerating the training process of convolutional neural networks.The PS(Processing System)of the Zynq chip serves as the control center of the entire framework.The PL(Programmable Logic)of the Zynq chip is designed to be the core of train computing which is responsible for the acceleration of the training process.The core of train computing are composed of the forward engine,the backward engine,and the hidden data queue,to achieve simultaneous operation of forward propagation and back propagation in training process.According to the characteristics of the training process,the calculation modules in the forward and backward engines are designed separately.In order to reduce the amount of intermediate data that needs to be buffered in the training process,a data encoding method is proposed in this paper to compress the intermediate data.The data storage capacity can be reduced to 4% and the data storage capacity can be greatly reduced.At the same time,according to the mathematical characteristics of backpropagation,a method to reduce backpropagation calculations is proposed to reduce the invalid calculation in backpropagation and improve the utilization of computing resources.In the processing of time delay,this paper adopts point-based data transmission methods and optimizes the calculation form of the convolution layer,which effectively reduces the computational output delay of the module.In order to verify the performance training framework based on Zynq chip,we use the Xilinx FPGA development kit to simulate and implement the training framework.Taking the Le Net-5 network as an example,the training model is built by the proposed training framework,and the board level implementation is carried out by using the ZC706 evaluation board.Then this paper tested the hardware implementation of the system: Under the 32 bit floating point precision,the computational performance of the framework for training the Le Net-5 network is up to 33.6GOPS,and the training iteration of 100 mini-batch takes 7.9ms,and the power consumption is less than 5W.At the same time,this paper compares the performance of FPGA based implementarion with the CPU and GPU platforms.The performance of the proposed training framework is 6.8 times of the CPU platform,and the energy efficiency is 9.7 times of the GPU platform.Therefore,compared with CPU and GPU platforms,the training framework designed in this paper can achieve higher computational performance with lower power consumption and more efficient to accelerate the training of convolutional neural network.
Keywords/Search Tags:Convolutional Neural Network, Training, Field Programmable Gate Array, Hardware accelerator, Parallelism
PDF Full Text Request
Related items