Font Size: a A A

Research On The Key Techniques Of Large Scale Parallel Computing On Accelerator-based Heterogeneous Systems For Applications

Posted on:2015-04-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:J ChaiFull Text:PDF
GTID:1108330479979656Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Large-scale scientific and engineering computing has become an important and essential method for current scientific research. Currently, supercomputers had entered the era with Peta-scale floating point computing power. However, a series of challenging real applications have proposed the computational requirement for Exascale computing power. Due to the advantage of performance per watt for accelerators like GPU and MIC,building accelerator-based heterogeneous supercomputers has become an important trend for HPC domain from Peta-scale to Exa-scale, such as GPU-enhanced Tianhe-1A and MIC-enhanced Tianhe-2. Domain application software is the guarantee of exploiting the computing ability of Exa-scale computer systems. Currently, how to develop real world domain applications quickly and exploit the power of heterogeneous systems efficiently,have become an extraordinary challenge problem for parallel computing discipline. Although China has the ability of building excellent supercomputers, the ability of developing domain application software is at a relatively low level which far behind the advanced countries.To incorporate research and engineering, this paper selects three representative real world large-scale scientific and engineering computing applications and the research platforms are Tianhe-1A and Tianhe-2. In the perspective of application software development, we study the key technologies of massively parallel computing based on GPU/MICenhanced heterogeneous systems. Our works are listed as follows:1. None of existing parallel implementations Bayesian phylogenetic inference is capa-ble of simultaneously and fully utilizing both CPUs and GPUs for the heterogeneous computations. To solve this problem, this paper presents a new hybrid parallel algo-rithm and implementation of Bayesian phylogenetic inference(called o MC3) based on Mr Bayes, which combines programming models of MPI, Open MP, and CUDA.The novelty of our algorithm is its ability of using CPU cores simultaneously with GPUs for the computations, while ensuring a fair work division between the two types of hardware components by proposing a simple but effective workload divi-sion scheme. Numerical experiments on Tianhe-1A show improved performance and nice scalability for o MC3. It is the first time that Mr Bayes has been scaled to thousands of CPU cores and hundreds of GPUs, which provides a reference for studying Mr Bayes on large-scale heterogeneous systems. This work is of general interest, because it discusses hybrid programming techniques, which can be learned by other bioinformatics applications.2. Currently, there exists no quantitative study of the performance of GPU clusters with more than 20 GPUs, when applied to cardiac electrophysiology. This paper presents a study of the applicability of clusters of GPUs to high-resolution simulations of cardiac electrophysiology. By using two cell modes, and two numerical solvers of ODE, we design and implement the tissue-level cardiac simulation solutions based on GPU-clusters. The Overall parallel strategy is implemented by domain decomposition of 3D data grid, and MPI+CUDA hybrid programming mode is used. The upper level is of multi-process parallelism between nodes, where the CPU is responsible for inter-node MPI communication. The lower level is of multi-threaded parallelism between massive CUDA cores on GPU. The GPU kernel implementation of PDE-ODE solver considers the thread granularity on GPU, and the locality when memory accessing. Experiments are achieved on Tianhe-1A, using up to 128 GPUs. We quantitatively analyze the obtainable computational capacity of GPU clusters for three different combinations of models and solutions. We believe that our investigation can provide cardiac researchers with a realistic estimate of the actual capability of GPU-enhanced clusters for such computations, which is the first time to using more than one hundred GPUs.3. Numerical simulation of subcellular Ca2+dynamics with a resolution down to one nanometer requires enormous computational power, however, has so far made such simulations prohibitive. This paper presents a solution for simulation of Ca2+ dy-namics toward nanometer resolution based on a CPU-MIC heterogeneous super-computer(Tianhe-2),and obtain real simulation results. Multi-level parallelism is achieved through domain decomposition of the 3D spatial grid. To overcome the challenge of programming Intel’s new MIC architecture, we adopt a series of optimization like SIMD vectorization, cache blocking, register reuse et.al. The data subdomain within one node is divided into boundary region and internal region, which is used in a pipelining approach to achieve effective coordination of CPU and MIC. By using up to 12288 Intel MIC coprocessors(4096 compute nodes) on Tianhe-2, we have achieved 1.27 Pflop/s in double precision and nice scalability, which brings us much closer to the nanometer resolution. We also obtained the results from simulations of sarcomeres in healthy cells for a simulating time of 24ms with 3 nm resolution, providing the field of biomedical research basis.4. Having multiple MIC within one compute node, compared with one coprocessor per node, presents an additional challenge to using supercomputers that are based on Intel’s many-integrated-core architecture. This paper presents a novel framework to facilitate programming heterogeneous systems with multiple MIC within one node for applications with Stencil computing pattern and structured grid, which is called MOCS. MOCS consists of a framework abstract of hybrid programming model, performance tuning strategies for load balancing and communication optimization, as well as specific development steps. Two low-level APIs of Intel MIC software stack are adopted: COI+SCIF. In MOCS, CPUs is the communication center and multiple MICs are computing center. Hybrid programming of MPI+Open MP+COI+SCIF are used. Load balancing between host CPUs and multiple MICs are considered. A hierarchical pipeline strategy is presented to improve inter-/intra-node communica-tion. In the experiment on Tianhe-2, we use MOCS to guide the programming and quantify the benefits in connection with solving a real-world 3D reaction-diffusion problem that consists of 7-point stencil computations and additional numerical op-erations. The results show that MOCS can exploit the multi-level parallelism in clusters with multiple MICs per node, hide inter-/intra-node communication cost, and efficiently incorporate CPU-MIC to computing simultaneously. MOCS also provide a reference for guiding programming for other HPC applications and sys-tems with other accelerators, like cluster with multiple GPUs per node.In summary, based on real world large scale applications, this paper represents several solutions to programming GPU/MIC-enhanced heterogenous systems efficiently, which are validated on Tianhe-1A and Tianhe-2 supercomputers. Our work can be of some theoretical significance and application value for research on large-scale heterogeneous parallel computing and the domain scientific research.
Keywords/Search Tags:GPU/MIC heterogenous system, Large-scale parallel computing, Hybrid programming model, Tianhe supercomputer series, Bayesian inference, Simulation of cardiac electrophysiology
PDF Full Text Request
Related items