Parallel Implementation And Performance Optimization For Refactoring GROMACS On The Sunway Many-core Architecture

Posted on:2019-09-03

Degree:Master

Type:Thesis

Country:China

Candidate:Y Yu

Full Text:PDF

GTID:2428330542494231

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

With the rapid development of the microstructure technology of many-core processor,more and more computing resources and storage resources are integrated on chip,which accounts for the complexity of microprocessor architecture.However,the plentiful on-chip resources and diverse hardware structures make it increasingly difficult for HPC community to port more applications onto many-core processor.The Sunway TaihuLight system,the fastest supercomputer in the world,is equipped with domestic heterogeneous many-core processor SW26010.This kind of many-core processor is based on master-slave structure and contains a total of 260 heterogeneous cores.The peak performance of each chip can reach to 3.06TFlops.In order to utilize the plentiful resources of the Sunway Taihulight system,present applications on commercial platform have to be refactored and optimized to fit the specific Sunway many-core architecture.GROMACS is one of the most popular open-source software packages for Molecular Dynamics(MD)simulation,which is efficient and widely-used in new material design,chemical process simulation and biological medicine.In this paper,we aim to refactor and optimize GROMACS on the Sunway Taihulight system.By solving a series of challenging problems during parallelization and optimization on the Sunway many-core architecture,we make full use of the computing resources and provide guides to improve application level and system architecture of the domestic many-core processor.The main works and contributions of this paper are as follows.(1)To fit the specific Sunway many-core architecture,we exploit a medium-grained task partition strategy and a proper parallel scheme for the compute-intensive kernel.By utilizing the parallelism between master core and slave core,we implement a task-level parallel mode based on three stage pipeline,solving the load imbalance and data dependency problems that were exposed during parallelization of the compute-intensive kernel without introducing additional execution time.(2)To solve the challenging bandwidth limitation problem and make full use of the computing resources,we introduce successive optimization strategies including the efficient use of scratchpad,DMA,the software-emulated cache and the hybrid parallel algorithm.By utilizing the locality of memory access and parallelism between slave cores,we reuse the runtime data of the compute-intensive kernel and hide the memory access overhead of slave cores effectively.A detailed analysis is presented concerning the implementation and benefits of each optimization strategy.(3)We compare the refactored GROMACS code using both MPE and CPE clusters to the serial official GROMACS code on MPE.By using 64 CPEs,we achieve up to a 27x speed improvement for the main compute-intensive kernel and a 6x speed improvement for the whole application in one single CG.In large node scale,we achieve approximately a 2x speed improvement for the peak simulating performance of GROMACS.

Keywords/Search Tags:

Parallel implementation, Performance optimization, GROMACS, Memory bandwidth competition, Sunway TaihuLight system

PDF Full Text Request

Related items

1	The Research Of High Performance Algorithm For GROMACS Based On Sunway TaihuLight
2	Design And Implementation Of Heterogeneous Parallel Algorithms On The Sunway Taihulight
3	Parallel Deep Learning Training System On Sunway TaihuLight
4	The Design And Optimization Of High-performance Molecular Dynamics Algorithms On The Sunway TaihuLight Supercomputer
5	Research On Directive-based Parallel Language For Sunway Taihulight Supercomputer And Design Of The Compiling Optimization
6	Porting And Optimizing GTC-P Code On Sunway TaihuLight Supercomputer
7	An Accelerated Ray Tracing Algorithm For The Sunway Taihulight
8	I/O Resource Monitoring And Diagnosis System For The Sunway TaihuLight
9	Parallel Algorithm Analysis And Optimization Of Plasma Structure Preserving Large-scale Simulation On Sunway Platform
10	Porting And Optimization Of OpenFOAM On The Sunway Taihulight Supercomputer