| The Von Neumann architecture is still the mainstream in now computer systems.The "bottleneck" of memory-access is still insurmountable,two operations related it can be divided into Sparse-matrix operation and Stencil.In this study,we investigate in-depth the two aforementioned scenarios on Sunway heterogeneous manycore architecture and optimize the existing algorithm.Our paper’s main contribution is the development of a new technique for improved performance,which goes as follows:First,we propose a method combined with blocked Jacobian and Cholesky algorithms.The blocking approach geometrically disengages the entire matrices and completely eliminates the data connection between the parallelizing matrices iterations.Additionally,we use an RMA-based double-cache many-core optimization mechanism for our Sunway system’s matrix multiplications.When the matrix dimension reaches2354928,experiments using Open FOAM benchmarks demonstrate that our solution solves the problem 9.15 times faster than the Incomplete Cholesky decomposition method and 8.3 times faster than the Geometric-Algebra Multi-Grid method.Second,we develop a discrete-memory-access optimization algorithm in our Sunway systems after extensive research on memory accessing in unstructured grids.And we make use of a message queue technique and the on-chip communication mechanism in slave cores to increase performance.To further enhance memory access performance,non-blocking data allocating approach is also implemented.The results demonstrate that our method uses an average memory bandwidth that is 70% of the theoretical value,and the target discrete memory access process in various kernels is accelerated with a maximum ratio of 45 and an average of 10.Our method also demonstrates its adaptability and durability across a range of domains and applications.Finally,we combine our program with the Stencil calculation’s characteristics and conduct a thorough analysis of the Stencil application on Sunway systems,which includes,(1)an adaptive four-level parallel framework based on Sunway architecture using a master-slave merging,which according to testing results outperforms the conventional three-level framework in terms of memory bandwidth and speeds up the master-slave process by a factor of 12 to 65.(2)a partial block parallelism and dynamic cache scheduling algorithm based on Sunway on-chip communication mechanism via RMA,which highly utilizes the space and time resources.(3)a mixed precision method combined with half-precision,single precision and double precision,which improves the overall performance while validating the results.The three aforementioned advances help to deliver a 7.53 times faster acceleration on the overall application with the 70.58% of the parallelized programs getting optimized,which makes 6.8 times lead than the idea’s,and thus achieve the 99.29% parallel efficiency with 27988480 cores in the global 500metre-resolution case using our new generation Sunway system. |