Font Size: a A A

The Research On Parallel Architecture For FPGA-based Convolutional Neural Networks

Posted on:2014-10-30Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z J LuFull Text:PDF
GTID:1268330425966963Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With great progress of IC design and manufacturing technology, FPGAs(Field-Programmable Gate Arrays) with high speed and high capacity have obtained fastdevelopment. Much more logic resources have been integrated on a single chip. A largenumber of programmable logic, inter-connection resources and storage are contained inmainstream FPGA chips. FPGAs with integrated DSP hard IPs can support high performancecomputing of multiplication. These features make FPGAs become an importance choice foraccelerating compute-intensive applications. Among most regarded compute-intensiveapplications, convolutional neural networks as one important multilayer neural networksalways site the central position of research subject and present a great value of solving patternrecognition problems in science area.The parallel architecture of CNN (Convolutional Neural Network) is the fundamentalpart of CNN unified computing framework. Based-on existing work, this thesis researches onCNN parallel architecture systematically and following contributions has been made:The CNN computation exhibits multiple types of parallelism. One of the main problemsof CNN parallel computing is how to design the parallel structure and exploit the parallelismaccording to those types of parallelism. Based-on “Host+FPGA” computing framework, theposition of the parallel computing units and the interface to other function units can bedecided. A configurable CNN computing parallel structure has been proposed, in which theconnections between the input and output feature maps and CNN computing units arecontrolled by cross-connection switches. The practical applications show that the computingstructures can be configured by the proposed structure according to different architecturalcharacters, which can fully exploit the intra-layer parallelism and raise the performance.With the constraints of FPGA resources, implementing complete parallel structures arenot feasible. Only parts of the convolution kernels can be implemented in parallel. How tomap numbers of convolution loops to limited computing elements is still a unsolved problem.This thesis presents the CNN intra-layer computing loop model, and then splits the intra-layercomputation according to different parallel computing structures and schedules the multipleiterations of convolution loop statements. The execution time of different computing structures can be obtained by the given cost functions, which shows the different computationsplitting methods within the resource constraints. The information can be used to choosedifferent CNN parallel structures.The CNN computing performance can be greatly affected by the computing efficiency ofconvolution kernels. One of the key problems of2D convolution computation is how todesign a good data buffering structure. The hardware implementation cost of2D convolutionand performance are mainly relied on the design of data buffer structure. The existing internaldata buffer structures have some deficiency in real applications. In order to make use of thestorage area, an area-optimized data buffer are proposed, which uses register-rotation strategyto exploit the data reuse in the convolution and the experiment results show that the proposeddata buffer structure makes full use of the storage bandwidth and area of on-chip memories.In order to exploit the inherent parallelism of CNN and raise the output throughput, a memorybandwidth-optimized structure is proposed, which uses fix-bandwidth broadcast strategy andsingle data-stream driven pipeline to make full use of on-chip shift register resources. Theexperiment results show that this structure reduces the off-chip memory bandwidthrequirements and optimizes the output throughput.One of the key problems of designing a CNN is how to decide the feature maps numbersin different layers according to various CNN applications. The thesis explores the designspace of different feature maps number in existing CNN applications, and designs a featuremap number configurable C models, which are trained and tested in NetBatch computingplatform. The experiment results show that the lower boundary can be found by this method,which can provide useful topology information for the RTL design.
Keywords/Search Tags:Convolutional Neural Networks, FPGA, Parallel Architecture, ComputationPartition, Data Buffering
PDF Full Text Request
Related items