Font Size: a A A

Research On Heterogeneous Reconfigurable Dataflow Accelerator For Big Data Applications

Posted on:2021-07-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:J Z ShenFull Text:PDF
GTID:1488306548492564Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years,the rapid development of artificial intelligence has attracted widespread attention worldwide.The continuous update iterations of deep learning algorithms represented by Convolutional Neural Networks and Graph Convolutional Neural Networks have dramatically increased the computing performance and energy consumption requirements of computer systems.However,due to the stagnation of the development of Moore's Law and the limitation of von Neumann structure in recent years,existing general-purpose computer systems cannot efficiently complete the calculation of deep learning algorithms.Customized hardware represented by FPGA provides new ideas for solving the above problems.On the one hand,the unique reconfigurable features of FPGA enable it to adapt to the rapid development and changes of algorithms,overcoming the problem of poor adaptability of ASIC solutions to emerging algorithms;On the other hand,the high energy efficiency characteristics of FPGA in accelerated deep learning algorithms also make it popular among researchers.As major domestic and foreign technology companies have deployed FPGA devices in data centers to build heterogeneous computing acceleration clusters,the advantages of CPU + FPGA architecture in processing big data applications have gradually been affirmed,and CPU + FPGA heterogeneous computing reflects very good Prospects.This article is based on the CPU-multi-FPGA heterogeneous fusion architecture and is oriented to the typical application of deep learning and big data,focusing on key technologies such as accelerator architecture,algorithm to hardware mapping scheme,dataflow accelerator,performance model,and distributed acceleration scheme.Our works are listed as follows:· Research on FPGA acceleration for 3D convolutional neural networks.This paper proposes a set of template-based architecture and design methods for the acceleration of 2D and 3D convolutional neural networks.First,in order to reduce the algorithm complexity of the convolutional neural network,we adopted the Winograd fast algorithm and extended it to support the 3D convolutional neural network.Secondly,we extracted the common operators of the two algorithms from the 2D / 3D convolutional layer algorithm adopting the Winograd algorithm,and designed a series of reconfigurable computation templates based on the common operators.Finally,we use high-level synthesis language to describe the templates and acceleration engine,then use high-level synthesis tools to generate the RTL code of the accelerator,resulting in the rapid generation of the template-based accelerator.Due to the differences between 2D and 3D convolutional neural networks,the previous methods of design space exploration proposed for 2D convolutional neural network accelerators may no longer be applicable to 3D convolutional neural network accelerators.To solve this problem,we propose a unified performance analysis model,and use the same design space exploration method to determine the optimal design parameters for the 2D and 3D convolutional neural network accelerators.Experimental results show that the VGG accelerator obtains the performance equivalent to the state-of-the-art neural network accelerator with lower resource overhead.In addition,the C3 D accelerator achieves 13 x of performance improvement over CPU.In terms of energy efficiency,the C3 D accelerator has achieved a 60 x and 30 x improvement over the CPU and GPU,respectively.· Research on the mapping scheme of the entire 2D / 3D convolutional neural network onto the FPGA platform.In order to solve the problem of the reduction of the accelerators' computation efficiency due to the difference in the size of the network layers,we further proposed a pipelined multi-accelerator solution.The main feature of this solution is that all the inter-layer data are stored on the FPGA chip,thereby the reusability of the inter-layer data is increased,and the overhead of off-chip memory access is effectively reduced.In this way,we further improve the throughput and energy efficiency of the accelerator compared to our prior work.In order to reduce the on-chip storage overhead of the inter-layer data,we also propose a layer fusion determination algorithm,which divides the network layer into two sets according to the storage capacity of the FPGA chip: fusion layers and non-fusion layers.By changing the loop order of the two successive convolutional layers in a fusion layer,the data reusability can be significantly improved.In order to solve the load imbalance among the pipelined accelerators,we propose a simple and efficient load balancing scheduling scheme to further improve the computation efficiency of the accelerators.Experimental results show that compared to the previously proposed accelerator design,the pipelined multi-accelerator solution achieves up 2.3x of performance improvement,we can achieve 64 times and 5 times of computation performance improvement compared to CPU and GPU,respectively.· Research on parallel acceleration schemes for 3D CNN-based medical image recognition applications.We have proposed a set of acceleration methods for lung nodule detection based on CPU-multi-FPGA heterogeneous computing platform.We take the 3D CNN-based lung nodule detection application as the target application.We first conducted an in-depth analysis of the parallelism of the algorithm and extracted the kernel workloads,i.e.LNS-net and LNCnet.The FPGA heterogeneous platform proposes ”model parallel” and ”data parallel” mapping schemes for different characteristics of the two networks.In view of the sparse computing characteristics of the deconvolution layer in LNSnet,we have improved on the basis of the 3D convolutional neural network accelerator proposed in Chapter 2,saving hardware resources and improving resource utilization.In addition,we designed two types of interconnections between FPGA nodes: Implemented a flexible FPGA node topology.We used a custom FPGA acceleration board for testing.The experimental results show that the heterogeneous system proposed in this paper has good scalability,and the acceleration solutions for LNS-net and LNC-net have obtained higher calculations than CPU and GPU.Throughput and energy efficiency ratio.At the same time,the system has reached the world's advanced level of detection accuracy.· Research on parallel acceleration scheme for deep graph convolutional neural network.We propose a distributed parallel acceleration scheme for deep graph convolutional neural networks.We selected a typical deep graph product neural network algorithm-DAGCN as the target network,and analyzed it from the aspects of computation pattern and sparsity of the algorithm,and verified the existence between the network layer accuracy and the network depth through experiments Positive relationship.We propose an efficient network mapping scheme: the host CPU is responsible for the aggregation operation of the output of network layers,while the FPGA responsible for the kernel computational tasks,i.e.the graph convolution layer.Similar to the pipelined parallel acceleration scheme mentioned earlier,we have implemented multiple graph convolution layer acceleration engines in each FPGA node to achieve parallel acceleration of multiple graph convolution layers.All the acceleration engines are organized as a deep pipeline across all FPGA nodes,aiming to improve the computational throughput of FPGA nodes.To determine the optimal design parameters and maximize the accelerator's performance,we developed mathematical models to evaluate the performance and resource consumption of the accelerators.We compare our accelerator with the stateof-the-art FPGA-based graph neural network accelerators,and the results demonstrate that we can achieve comparable computational performance and computational efficiency.
Keywords/Search Tags:Deep learning, bigdata applications, CPU+FPGA hybrid computing, 3D convolutional neural networks, lung nodule detection, graph convolutional neural network
PDF Full Text Request
Related items