Font Size: a A A

Research On Programming Models And Optimizations For Petascale CPU-GPU Heterogeneous Computing Systems

Posted on:2014-03-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:F WangFull Text:PDF
GTID:1108330479479647Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The era of petascale has come for the supercomputer systems which are driven by the never-stopped demands from the science computing. All the key technology built for the petascale systems is the foundation of the future exascale systems which will not be too far from now. Due to the feature size limitation of CMOS technology, non-revolutionary development of power and cooling technology, the homogeneous computer architecture built only by CPUs is very hard to be scaled further after it has reached petascale. However, the heterogeneous architecture using GPUs as the accelerators has more advantages regarding the performance per watt. This makes the heterogeneous architecture as one of the most promising technology used to build up the exascale systems. Tianhe-1A system which is built by the National University of Defense Technology for National Supercomputer Center in Tianjin use NVIDIA Fermi GPUs and achieved sustained 2.566 PFLOPS(1015FLoating-point Operations Per Second) performance, which made Tianhe-1A the fastest supercomputer on the Top500 list in November, 2010. This CPU-GPU heterogeneous architecture combines a mix of general and specific processing elements to provide increasingly more powerful systems. But the programming and the optimizing are much more different with the traditional homogeneous systems, and become the keys to deliver the computational capability.Targeting the problems of programming and optimizing on the heterogeneous system, this thesis researches the programming models and optimization methods based on the petascale CPU-GPU heterogeneous supercomputer. The contributions of this thesis could be summarized as following:1. The MPI/Open MP/Streaming hybrid programming model is introduced to the petas- cale CPU-GPU heterogeneous architecture for the first time and eventually applied to full system scale. Three work partition schemes, namely node based partition,CPU based partition and GPU based based partition, is proposed to improve the load balance for heterogeneous systems. Furthermore, seven qualitative measures,which are the simple programmability, performance scalability, memory scalabil-ity, memory-accessing hierarchy, scheduling flexibility, heterogeneity adaptivity, and focus on node, are proposed to evaluate current programming models for thefuture exascale node-level programming. A programming prototype is also provid-ed to demonstrate how to share the GPU between several processes.2. An adaptive optimization framework is presented to balance the workload distri- bution across the GPUs and CPUs in heterogenous systems. All tasks are put in a queue and thus executed consequently. Each task is divided into CPU part and accelerator part based on a variable named division-ratio. Note that the initial value of division-ratio is computed by the theoretical peak performances of the CPU and the accelerator. When a task is finished, the actual performance of the CPU and the accelerator are utilized to update the division-ratio. The updated division-ratio will be used in the next task partition. This adaptive adjustment of division-ratio could achieve good load balance between the host and the accelerator, and improve the compute efficiency of the whole heterogenous systems.3. A recursive double-buffered software pipeline technique based on Finite State Ma- chine is proposed to improve the computation efficiency on heterogeneous archi- tectures. Every task has three phases:input, execution, and output. Based on the analysis of the heterogenous architectures’ execution and cost models, an recursive double buffering software pipeline is implemented using a finite state machine. One specific CPU thread controls the overlap between input/execution/output phases of different tasks. The experimental results show that the proposed method could hide the communication overhead effectively. Further, the fluctuant performances of GPU based libraries are also smoothed. The average performance improvement achieves 7.61% on several BLAS3 DGEMM experiments with different work loads.4. Hybrid-LINPACK is proposed and implemented in petascale CPU-GPU heteroge-nous systems. First, a heterogenous BLAS library which suites both CPU and GPU is implemented. Combining such a BLAS library with the MPI/Open MP/Streaming hybrid programming model, as well as High Performance LINPACK(HPL 2.0) for homogeneous systems, we proposed an efficient LINPACK for hybrid heteroge- nous systems. The major optimizations include the work load balance between CPU and GPU, better communication performance between CPU and GPU, paral-lel SWAP algorithms, transfer optimization between computation nodes and some traditional parameter selection skills in HPL. These techniques make a relatively full use of the powerful computation ability and great communication capacity on given heterogenous architecture. For a single computation element of Tianhe-1, our implementation achieves 70.1% efficiency of the peak performance and a speedup of 3.3X compared with the LINPACK implementation provided by AMD. Tianhe-1 and Tianhe-1A ranked NO.5(2009.11) and NO.1(2010.11) respectively in the TOP500 lists, and both systems achieved the best score by supercomputers of our country in Top 500 lists.
Keywords/Search Tags:Heterogeneous parallel system, GPU, Petascale, Programming Model, Parallel Optimization
PDF Full Text Request
Related items