Font Size: a A A

Architecture Research On FPGA-Centric Cluster Based Accelerator Level Parallelism

Posted on:2021-05-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:T Q WangFull Text:PDF
GTID:1368330602497394Subject:Microelectronics and Solid State Electronics
Abstract/Summary:PDF Full Text Request
High-performance computing(HPC)has become the main driving force for devel-opment in many fields of science and technology,and supercomputers for HPC have also become an important infrastructure.The mainstream of HPC research is to explore more parallelism opportunities of calculation to speed up the calculation process of the algorithm,and the development of hardware technology of supercomputers also follows the same routine.In recent decades,the development of high-performance processor hardware technology has ranged from the initial Bit-Level Parallelism(BLP)that in-creased the processor's bit width to the Instruction-Level Parallelism(ILP)that pursued out-of-order execution/superscalar execution,and then to multi-core CPUs and hard-ware accelerators represented by GPUs and Xeon Phi(Thread-Level Parallelism(TLP)and Data-Level Parallelism(DLP)).It can be seen from historical experience that when researchers find it difficult to explore parallelism at a fine granularity,they will turn to searching for more coarse-grained parallelism.Today,it has become increasingly difficult to continue to design next-generation high-performance processors along the path of Thread-Level Parallelism and Data-Level Parallelism.Thus the industry has to find parallelism opportunities at a coarser gran-ularity.Professor Mark D.Hill of the University of Wisconsin-Madison and Professor Vijay Reddi of Harvard University proposed the concept of Accelerator-Level Paral-lelism(ALP).That is,multiple sub-tasks of an application are simultaneously executed in parallel on multiple customized accelerators.Since the basic parallelism granular-ity in ALP is a customized hardware accelerator,it has better performance and lowers power consumption than the general-purpose processors in TLP and DLP.As an emerging reconfigurable hardware accelerator,FPGA has gained more and more attention in the field of HPC.The high parallelism,high customizability,and low power consumption of FPGA make it suitable for being used as a highly customized ac-celerator.Moreover,FPGA also integrates several high-performance serial transceivers that can provide multi-port high-bandwidth,low-latency communication service.Based on these characteristics of FPGA,this dissertation is based on an FPGA-Centric cluster and explores the architecture design of a reconfigurable cluster suitable for HPC with the idea of accelerator-level parallelism.For some typical applications in the field of HPC such as grid computing,artificial intelligence training,etc.,a hard-ware accelerator suitable for the FPGA-Centric cluster is designed,and the scalability of multiple accelerator nodes is also explored.The main research work and innovations of this article include:(1)This dissertation proposes an implementation of an FPGA-to-FPGA direct in-terconnection network in an FPGA-Centric cluster.In this design,the high-bandwidth and low-latency serial transceiver integrated in the FPGA chip is used to construct the physical layer interconnection,and the router logic is instanced into each FPGA ac-celerator node in a distributed manner.The framework supports two communication models:message-passing communication model and streaming communication model to provide services for communication between accelerators.This design also supports a more advanced Collective communication mode,includes Multicast,Reduction and other communication functions widely used in HPC.(2)Based on the aforementioned streaming communication model in the FPGA-Centric cluster,this dissertation proposes a scalable method for accelerating deep neural network(DNN)training-FPDeep.FPDeep uses layer parallelism and model parallelism to deploy the training tasks of deep neural networks on the cluster in a distributed man-ner.FPDeep solves the problem of low scalability caused by an increase in Batch-size when DNN training is performed on a traditional supercomputer.Furthermore,FPDeep also proposes an algorithm for achieving a balanced decomposition of computation workload and model weights between FPGA computing nodes.This algorithm takes advantage of the high bandwidth and low communication latency between accelerators in the FPGA-Centric cluster and realizes the efficient use of computing resources in the FPGA-Centric cluster.The experimental results show that the performance of FPDeep is nearly linearly scalable on an FPGA-Centric cluster containing 100 FPGAs,and the energy efficiency of FPDeep is 6.4 times higher than that of the most advanced GPU cluster.(3)Based on the aforementioned message-passing programming model in the FPGA-Centric cluster,this dissertation proposes a method for accelerating adaptive mesh refinement computing-FP-AMR.FP-AMR can offload operations involving dy-namic data structures such as particle mapping/mesh refinement/mesh coarse-graining that must be processed by the CPU in the traditional method to the FPGA side for exe-cution.Using the low-latency and high-bandwidth communication of FPGA-to-FPGA in the FPGA-Centric cluster,FP-AMR can bypass the CPU side and directly complete the information synchronization between FPGA-Centric cluster nodes.Since the above operations must be performed multiple times in each time step in the adaptive mesh re-finement algorithm,FP-AMR can significantly improve the overall performance of the system.This article takes the AP3M algorithm in cosmology dynamics simulation as an example to show the method of deploying an adaptive mesh refinement algorithm on the FPGA-Centric cluster based on FP-AMR.FP-AMR has 21-23 times higher performance than traditional multi-core CPU algorithms.
Keywords/Search Tags:Reconfigurable Computing, High Performance Computing, FPGA-Centric Cluster, Convolution Neural Network Training, Adaptive Mesh Refine-ment
PDF Full Text Request
Related items