Font Size: a A A

Parallel Design And Optimization Of SpMV On ARM Multi-core Platform

Posted on:2022-10-02Degree:MasterType:Thesis
Country:ChinaCandidate:Y F ZhangFull Text:PDF
GTID:2518306731487884Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Sparse matrix-vector multiplication(Sp MV)is one of the core subroutines in numerical computation.The solution of large-scale linear equations is one of the major applications,and an exact solution is usually accessed by an iterative method.Sp MV,a key step in solving systems of linear equations,may be performed thousands of times in the solution process.However,the complexity of the associated hardware and the load imbalance caused by the sparsity of sparse matrices can lead to memory bottlenecks,making it challenging to optimize the Sp MV performance.ARMv8-A is an architecture for high-performance computing introduced by ARM.An increasing number of HPC researchers and companies have been drawn to the ARMv8-A architecture,as it supports 64-bit instruction sets,improving the doubleprecision floating-point arithmetic capability,and supporting single instruction multiple data(SIMD)operations via NEON,which is a SIMD instruction extension architecture of ARM.To improve the performance of Sp MV on an ARM multi-core processor platform,this paper makes the following contributions:In connection with the discontinuity of the nonzero elements in a sparse matrix,the constituent of the corresponding vector x can only be loaded to the vector register sequentially,which leads to the efficiency of the SIMD unit degradation.We present the aligned storage formats ACSR and AELL based on the CSR and ELL formats,which align the SIMD registers of ARM processors.Then we analyze the impact of SIMD instruction latency,cache access,and cache misses on Sp MV with different formats.And the relationship between instruction latency and cache access times and the zeroelement filling rate in the aligned storage format is found.In the experiments,our Sp MV algorithm based on ACSR and AELL can improve the efficiency of memory access and the use efficiency of vector registers,which achieves 1.18 x speedup over Sp MV based on CSR,1.56 x speedup over Sp MV in PETSc,and 1.21 x speedup over ELL on Kunpeng920 processor.What's more,The deviations between the theoretical results and experimental results in the instruction latency and cache access are 10.26%and 10.51% in ACSR and 5.68% and 2.91% in AELL,respectively.For the memory bottleneck caused by the NUMA architecture on an ARM multicore platform,we adopt a block restructuring chunking strategy and a NUMA affinity strategy.The sparse matrices are partitioned into fine-grained blocks.Then the memory space is redistributed on the NUMA nodes according to the distribution of nodes after combining the computing cores of the processor,which reduces the latency of memory access caused by data migration during calculation.In the experiments,the block partition strategy achieves an average acceleration ratio of 2.66 and the average speedup is 2.44 x compared with the Sp MV in PETSc on Kunpeng920 processor.In this paper,we improve the efficiency of using SIMD arithmetic and the efficiency of memory access in NUMA architecture on an ARM multi-core platform,which improves the memory access bandwidth and processor utilization efficiency of Sp MV calculation.And this work is suitable for the mainstream domestic ARM multi-core processor.
Keywords/Search Tags:ARM Multi-core processors, NUMA, Parallel computing, SIMD, Sparse matrix storage formats, SpMV
PDF Full Text Request
Related items