Font Size: a A A

Parallel Implementation And Performance Optimization For Refactoring GROMACS On The Sunway Many-core Architecture

Posted on:2019-09-03Degree:MasterType:Thesis
Country:ChinaCandidate:Y YuFull Text:PDF
GTID:2428330542494231Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the rapid development of the microstructure technology of many-core processor,more and more computing resources and storage resources are integrated on chip,which accounts for the complexity of microprocessor architecture.However,the plentiful on-chip resources and diverse hardware structures make it increasingly difficult for HPC community to port more applications onto many-core processor.The Sunway TaihuLight system,the fastest supercomputer in the world,is equipped with domestic heterogeneous many-core processor SW26010.This kind of many-core processor is based on master-slave structure and contains a total of 260 heterogeneous cores.The peak performance of each chip can reach to 3.06TFlops.In order to utilize the plentiful resources of the Sunway Taihulight system,present applications on commercial platform have to be refactored and optimized to fit the specific Sunway many-core architecture.GROMACS is one of the most popular open-source software packages for Molecular Dynamics(MD)simulation,which is efficient and widely-used in new material design,chemical process simulation and biological medicine.In this paper,we aim to refactor and optimize GROMACS on the Sunway Taihulight system.By solving a series of challenging problems during parallelization and optimization on the Sunway many-core architecture,we make full use of the computing resources and provide guides to improve application level and system architecture of the domestic many-core processor.The main works and contributions of this paper are as follows.(1)To fit the specific Sunway many-core architecture,we exploit a medium-grained task partition strategy and a proper parallel scheme for the compute-intensive kernel.By utilizing the parallelism between master core and slave core,we implement a task-level parallel mode based on three stage pipeline,solving the load imbalance and data dependency problems that were exposed during parallelization of the compute-intensive kernel without introducing additional execution time.(2)To solve the challenging bandwidth limitation problem and make full use of the computing resources,we introduce successive optimization strategies including the efficient use of scratchpad,DMA,the software-emulated cache and the hybrid parallel algorithm.By utilizing the locality of memory access and parallelism between slave cores,we reuse the runtime data of the compute-intensive kernel and hide the memory access overhead of slave cores effectively.A detailed analysis is presented concerning the implementation and benefits of each optimization strategy.(3)We compare the refactored GROMACS code using both MPE and CPE clusters to the serial official GROMACS code on MPE.By using 64 CPEs,we achieve up to a 27x speed improvement for the main compute-intensive kernel and a 6x speed improvement for the whole application in one single CG.In large node scale,we achieve approximately a 2x speed improvement for the peak simulating performance of GROMACS.
Keywords/Search Tags:Parallel implementation, Performance optimization, GROMACS, Memory bandwidth competition, Sunway TaihuLight system
PDF Full Text Request
Related items