Font Size: a A A

Study Of Porting And Optimization Of GTC-P On Large Scale System Using OpenACC

Posted on:2019-02-15Degree:MasterType:Thesis
Country:ChinaCandidate:Y M WeiFull Text:PDF
GTID:2428330590467389Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years,with the fast development of accelerator,such as GPU,accelerator-based heterogeneous computing has risen in popularity in high performance computing.With the increasing complexity of the cluster architectures,running applications on different architec-tures aften requires different versions of code,which poses a great challenge to the developer.OpenACC is a directive-based parallel programming model,which provides performance on and portability across a wide variety of platforms,including GPU,multicore CPU,and many-core processors.GTC-P is a discovery-science-capable real-world application code based on the Particle-In-Cell?PIC?algorithm that is well-established in the HPC area.Basic versions of this code have demonstrated performance portability on TOP500 supercomputers with different architectures,including Titan,Mira,etc[1].Besides,it is included in the US Department of En-ergy's NERSC National Supercomputer Center benchmark test set[2].We use OpenACC port and optimize GTC-P based on the OpenMP version of code and evaluate its performance porta-bility on multi platforms across large scale system.With further optimization including data locality optimization,thread mapping optimization and insert CUDA code,we achieve 4.2×speedup compared with the OpenMP code on single node.OpenACC achieves over 90%per-formance of CUDA version with only about 300LOC.We perform scaling evaluation on Titan with up to 4096 nodes and analyze its performance with CUDA version code.The evaluation result shows that OpenACC still achieve comparable scalability with CUDA on such large scale system.The main contribution of this study are as follows:First,we implement and optimize the first OpenACC version of GTC-P.After further optimization,including data locality,thread mapping and CUDA optimization,OpenACC version achieve 4.2×speedup.We notice that atomic operation has great impact on the performance.We propose two different optimization methods to reduce atomic influence on x86 multicore and GPU.Second,as far as we know,this is the first time use OpenACC to port and evaluate application on such large scale system.We adjust the algorithm to reduce the GPU memory usage by redundant computing,which enable us to simulate larger test case.We scale the OpenACC code on up to 4096 nodes on Titan.OpenACC is shown to be able to deliver impressive productivity and performance with respect to portability and scalability.
Keywords/Search Tags:high performance computing, OpenACC, PIC, parallel computing, CUDA
PDF Full Text Request
Related items