| Stencil is a fixed computation pattern that updates elements over multi-dimensional data structures,which is widely used in the field of scientific computing and is usually the computation-intensive kernel or hot loop of some HPC applications.Jacobi,as a typical class of floating-point applications,is one of the most popular and classic types of stencils.Therefore,it is of great research significance to optimize the performance of stencil applications for specific architectures.There are two parallel computation methods in stencil applications,namely “Cells in Parallel” within the same iteration and “Iterations in Parallel” between different iterations.Most of the researchers’ optimization work is focused on this important performance feature of stencil applications using both Data-level Parallelism and Threadlevel parallelism.In this study,we used ARM SVE instructions to implement vectorization and Open MP programming model to develop coarse-grained parallelism of threads for four popular Jacobi stencil codes: 2D5 P,2D9P,3D7 P and 3D27 P.Since stencils are typically memory-bound applications,the programs contain a large number of repetitive and overhead memory access operations with relatively low computational density.The performance of SVE is often limited by the memory bandwidth,and the modification of assembly codes using SVE instructions is very tedious and complex.At the same time,the threads in simple Open MP parallel codes switch frequently on CPU cores,which leads to resource competition among threads and low hit rate in Data Cache of cores,introducing additional parallelism overhead.In this paper,the above problems are studied and solved on the domestic processor platform,and the main work and innovations include:(1)In this paper,an optimization strategy for data reuse in multiple dimensions is designed based on the Data-level Parallelism using SVE instructions.This optimization strategy reuses the loaded data in two consecutive iterations,and replaces frequent memory access operations by manipulating the registers where the reusable data are stored,which greatly reduces the number of repeated memory accesses and computations,improves the actual computational efficiency of stencils,and achieves a significant performance improvement.For stencil applications with less reusable data,such as 2D5 P or 3D7 P,this paper chooses loop unrolling in multiple dimensions as the optimization strategy to further expand the range of reusable data,effectively improve the computational density and the performance optimization effect.(2)Based on the development of coarse-grained Thread-level Parallelism using Open MP,this paper designs an optimization strategy that binds threads with CPU cores and data blocks in two dimensions,which effectively reduces resource competition among multiple threads and improves the hit rate of Data Cache,to reduce the parallel overhead of simple Open MP parallel code and further obtain performance improvement.(3)Based on the optimization work with Data-level Parallelism and Thread-level Parallelism,this paper extracts and summarizes the optimized codes for different types of stencils,designs a flexible and configurable optimized code templates library,and implements a complete performance optimization framework on the domestic multi-core processor platform.The framework is implemented as a library function,which parameterizes some basic performance features of stencil applications as input parameters of the APIs,so that programmers could automatically map stencil codes to the optimized code templates by calling the API directly to generate the efficient SVE assembly codes.In this paper,we test and analyze the performance of different optimization strategies on the new generation “Phytium” 64-core processor platform.The experiments show that in the work of Data-level Parallelism,when the vector length is increased from 128 to2048 bits,the optimized code achieves a maximum performance improvement of 2.88 x compared to the straightforward SVE implementation,8.91 x compared to the Neon code,and 16.31 x compared to the scalar code;in the work of Thread-level Parallelism,the optimized parallel code using 64 threads achieves a performance speed-up of up to 2.09 x compared to the simple Open MP parallel code,and up to 34.02 x compared to the serial code,with significant optimization effects. |