Font Size: a A A

Research On Energy-Efficient Implementation Of Convolutional Neural Network Based On Heterogeneous FPGA Cluster

Posted on:2022-10-13Degree:MasterType:Thesis
Country:ChinaCandidate:Y B HuFull Text:PDF
GTID:2518306602965259Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
With the continuous development of artificial intelligence,autonomous driving and other fields,data-driven high-performance computing is becoming increasingly important.Convolutional neural network is one of the most popular techniques in artificial intelligence.It puts forward higher and higher requirements on the computing performance and power consumption of the processor hardware.FPGA-based CNN accelerators have shown excellent performance in the field of real-time inference such as autonomous driving,but the growth in resource requirements in CNN far exceeds the growth in resources integrated into an FPGA.As a result,limited on-chip resources hinder the development of model parallelism,which further hinders improved performance and energy efficiency.Therefore,it is difficult for a single FPGA to flexibly meet the requirements of high throughput and energy-efficient accelerated by CNN with different computing scales in various scenarios.Based on this,this paper proposes a method based on heterogeneous FPGA cluster to flexibly implement the energy-efficient and high throughput requirements of CNN acceleration.Based on the characteristics of heterogeneous FPGA cluster,this paper adopts the extensible inter-board pipelining structure for parallel development to achieve high throughput.Due to the complexity of CNN structure,it is difficult to reasonably and evenly distribute tasks in FPGA cluster execution.If CNN tasks are distributed unevenly,throughput and energy efficiency will be reduced.In order to reasonably deploy the CNN to each FPGA,this paper propose a general task assignment method to explore the optimal deployment of each layer of the CNN model on each FPGA.There may be a mismatch between bandwidth and computing power in FPGA design,which leads to the problem of low efficiency of computing resources and ultimately hinders the improvement of energy efficiency.In this paper,the relationship between computing power and bandwidth is accurately analyzed.Through formula solution finally the top of Roofline model is obtained to improve resource efficiency and energy efficiency.The main contents of this paper are as follows:1.A scalable pipeline architecture based on heterogeneous FPGA clusters is proposed to achieve high throughput.The global cache of each FPGA alternately caches input and output data to reduce memory access.Aurora high-speed serial links are used to communicate between FPGAs to reduce interboard communication delays.Heterogeneous FPGA chips are integrated by ARM and FPGA into So C,each So C is a node of the cluster.The part of ARM is called the PS end,and the part of FPGA is called the PL end.The cluster is composed of a master node and several slave nodes.The master node sends and receives CNN tasks continuously,and the PL end of each node has a corresponding CNN accelerator IP to complete part of CNN computation.Finally,using PS to control PL to achieve the heterogeneous computing operation mode of FPGA cluster.2.Different from the traditional task assignment method,this paper proposes the idea of using HLS delay for task assignment,and proposes a general task assignment method using dichotomy to evenly deploy all layers of CNN in the cluster to improve the throughput and energy efficiency.This method has a quick solution advantage over the exhaustive method,and to our knowledge,we are the first person to use the dichotomy method to solve the task assignment problem.3.The parallel acceleration design of CNN Accelerator IP is carried out,and a method to optimize the design of CNN Accelerator IP is proposed.By accurately analyzing the relationship between bandwidth and computing power,this paper analyzes whether the limiting factor affecting performance improvement is bandwidth or computing power.By matching the relationship between the two,the optimal point of Roofline model is obtained,and finally the resource efficiency is improved and energy-efficient is obtained.With Alex Net,VGG-16 and Mobile Net as CNN tasks,heterogeneous FPGA clusters with 1to 6 nodes were built for the experiment.The task assignment method proposed in this paper achieves higher throughput and energy efficiency than the traditional method in the task assignment problem of Alex Net,which achieves 10.5% throughput and 19.65% energy efficiency improvement,respectively.The heterogeneous FPGA cluster implement CNN inference in this paper can achieve a throughput of up to 159.3GOPS or energy efficiency of up to 9.84 GOPS/W,and a DSP resource efficiency of up to 6.38E-4 GOPS/ DSPs/Freq.In terms of energy efficiency,especially VGG-16 with fixed 16 accuracy,it is 43.3% higher than GPU,59.5% higher than single FPGA design,and 18.8% higher than previous FPGA cluster energy efficiency.The results show that the design of this paper obtains energyefficient result of CNN inference,and the result reaches the design goal of this paper.
Keywords/Search Tags:CNN, FPGA cluster, Heterogeneous computing, Energy efficiency, Throughput
PDF Full Text Request
Related items