Font Size: a A A

The Optimization Of The Tend_lin Application Task Graph Parallel On Sun Way TaihuLight Supercomputer

Posted on:2019-01-04Degree:MasterType:Thesis
Country:ChinaCandidate:Q GuoFull Text:PDF
GTID:2428330578970591Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
In the 500 top ranking of the global supercomputer,the latest generation of China's independent supercomputer,"the Sunway TaihuLight supercomputer," has won four successive championships,becoming the world's first supercomputer with more than one billion billion times.The "SW26010" heterogeneous core processor integrates 4 core groups with a total of 260 computing cores,each containing 1 operational control cores(MPE)and 64 operational core(CPE),and the application needs to be transplanted to the core to give full play to the supercomputing performance of "the Sun way TaihuLight supercomputer".Sunway of multi-core processor support multi-core programming model(Sunway OpenACC)acceleration and acceleration of thread library(Athread library).The atmospheric circulation model(Atmospheric General Circulation Model,AGCM)of the Academy of Sciences of the Chinese Academy of Sciences is the most complex model of the earth simulator.Using partial differential equations to solve physical problems is a typical application of HPC.AGCM includes two parts:physical process and dynamic framework.Dynamic frame is the largest part of computation.The adaptive process(tend_lin)is the hot spot of calculation.In order to explore the parallel ability of the grid application task map,the compiling task group of the Institute of computing technology of the Chinese Academy of Sciences has developed a task map parallel scheduling system for grid applications,and transplanted it to the "Sun way TaihuLight Supercomputer" platform to support the task map parallelization on the domestic core.In this paper,in the Sunway TaihuLight Supercomputer tend_lin Application Research on parallel task graph.Specifically,we first transplant and optimize the tend_lin application based on Sunway OpenACC in the Sunway TaihuLight Supercomputer and transplants the parallel computing cycle in the application to the core of the Deuteronomy processor through the compilation instructions of the OpenACC.In the process of data transmission and transplanting,the main purpose is to carry out cyclic index transmission and complete transmission.Four data management methods such as transposed transmission and data transmission.In the aspect of performance optimization,we eliminate GLD direct discrete access and achieve DMA bulk access.In terms of correctness verification,each cycle of computation is input into files for successive comparison.Secondly,it is the DAG parallelization of the Athread based tend_lin application in the Sun way TaihuLight supercomputer,and the choice of heterogeneous mapping with different tasks.In the application,the data footprints of the stencil stage are relatively large to be set from the nuclear task,the calculation pole and the smooth filtering of some computation cycle data footprints relatively small setting.The main nuclear task.The transformation of the nuclear code is realized through the transformation of the array declaration,the upper and lower bound transformation of the loop and the transformation of the reference point.The implementation of the DMA transmission is carried out by calling the athread_get/put interface of the thread library to carry out the batch transmission of the data.The size of the space saved from the nuclear power station is 64K.The data footprint of the computing cycle of the stencil phase will be greater than that of 64K,and the two partition is carried out in the sub high dimension.In the task registration part,the global consistent address registration should be carried out on the array,and the DMA transmissions are realized in the task execution part.In the application,the problem of computing cyclic load imbalances is discarded,and the strategy of using each for task group as the cyclic(1)rotation method from 0 to the num_threads-1 thread automatic affinity setting,is modified to connect all for tasks together and adopt the affinity continuity setting between the for task groups.Finally,the performance of the two parallel methods is tested.The results show that the performance of the parallel version of the task graph based on Athread is superior to the parallel version of the OpenACC,and the parallelization of the task graph can effectively improve the performance of the program.
Keywords/Search Tags:"Sunway TaihuLight supercomputer", atmospheric circulation mode, task graph parallelization, DMA transmission, transplant
PDF Full Text Request
Related items