| With the development of computer microprocessors towards many-core processors and the continuous emergence of large-scale clusters,hybrid parallel programming based on heterogeneous platforms will become the mainstream in large-scale parallel applications in the future.In the multi-core system,the traditional parallel programming technology can’t be applied efficiently.Aiming at the special architecture of multi-core cluster,it is of universal significance to study the corresponding programming model and parallel programming technology to make it have higher performance.In addition,for supercomputing users,the programming workload caused by program migration between different architectures is huge,which directly affects the work efficiency.The application optimization of Sunway Taihu Light supercomputer shows that the coding mode in program transplantation has large template characteristics,and the basic code can be generated automatically by summarizing a formatted template and then automatically generating the basic code through code conversion technology.This thesis has done the following work and innovations for the implementation and application of hybrid parallel programming models under different architectures.1.The architecture of Sunway heterogeneous many-core processor and the slavecore code suitable for the processor are studied in detail.Some optimization methods are proposed and some interface functions are researched,such as parameters passed by a structure,local static variables and slave-core partition parallelism,the communication among slave-core interface function and master-slave asynchronous hybrid parallel interface function.Verified by experimental comparison,the use of the above optimization methods and interface functions are of great help to improve the performance of the program.2.In order to solve the problem of Athread coding difficulties and improve the efficiency of many-core coding under the Sunway heterogeneous many-core environment,an Athread code generation tool that automatically converts serial kernels into Athread parallel codes was designed and developed.This thesis is based on a three-tier program template in which the main program calls the master program and then the master program calls the slave program,and the Rust language is used for lexical and grammatical analysis.Through the above steps,a method that can automatically convert the source program into athread-format code is proposed,and some useful optimization methods are also integrated.At the same time,some interface functions which are helpful to program optimization are added to further improve performance and reduce the code workload of program porting.Finally,a prototype of conversion tool from Fortran and C codes to athread codes is designed and implemented.Some experimental results show that the Athread code generated by the automatic generation tool has a higher speedup than the Open ACC* accelerated program.Especially,for multiple kernels the speedup can be about 15%,which prove that the generation tool is valuable in practical application.This method can avoid the vast majority of errors in coding and greatly improve the efficiency of many-core work for researchers.3.To improve the execution efficiency of programs with hybrid parallel programming model under Shanhe supercomputing platform,several common implementation methods of hybrid parallel programming models and corresponding optimization methods for Shanhe supercomputing platform are first described,then the CPU architecture used by cluster is analyzed,and several hybrid parallel running modes suitable for this platform are proposed.Taking this platform as the test bed,two typical benchmark programs which are computing-intensive and communication-intensive are tested with different combinations of processes and threads with thousands of cores and ten-thousands of cores.The running results show that the execution time of the computing-intensive programs of irregular memory access with each node 8 MPI processes and each process with 7 threads performs the best,which is about 20% better than that only with MPI.The running results show that the execution time of the communication-intensive programs of irregular memory access with each node 28 MPI processes and each process with 2 threads performs the best,which is about 10% better than that only with MPI.The hybrid parallel running model can provide useful reference for users on Shanhe supercomputing platform. |