Font Size: a A A

Research On Parallel Optimization Of BLAS Based On The New Generation Of Sunway Many-core Processor

Posted on:2022-12-16Degree:MasterType:Thesis
Country:ChinaCandidate:Y XuFull Text:PDF
GTID:2518306773497654Subject:Automation Technology
Abstract/Summary:PDF Full Text Request
The BLAS library provides a large number of important basic linear algebra operation functions,which are used in high-performance scientific computing and artificial intelligence software widely.Some software and hardware manufacturers have optimized BLAS to varying degrees for different platform architecture characteristics.However,due to various reasons such as architecture and platform characteristics,various open source versions and BLAS libraries developed based on other processor architectures have problems such as compatibility and insufficient performance on the new generation of Sunway many-core processor.These problems result in that the performance of the processor cannot be fully utilized,and various upper-layer applications that have extreme demands on performance cannot be well supported.The implementation of high-performance BLAS library on the new generation of Sunway many-core processor is of great significance for giving full play to the performance of processor,promoting the smooth development of large-scale numerical calculation and its related research and Application on relevant platforms,and enriching the software ecological environment of Sunway supercomputing platforms.The main contributions of this paper are as follows:1.According to the characteristics of first-and second-level functions of BLAS and the memory access and computing characteristics of Sunway many-core processor SW26010 Pro,this paper proposes a task blocking and balancing strategy to make each slave core achieve load balancing.An optimization method based on DMA memory access and SIMD calculation is designed to improve the memory access speed and calculation speed.This paper analyzes the performance of memory access and calculation of different data sizes,and puts forward a delay overlap test scheme for memory access and calculation,which achieves a good double buffering effect.The parallel versions of BLAS first-and second-level functions are implemented and optimized.Experiments show that the bandwidth of the parallel version of the optimized BLAS first-and second-level functions reaches 90% of the peak performance of the hardware compared with the serial version.2.The slave core array of Sunway many-core processor has the characteristics of startup delay and on-demand startup.In the case of small data scale,the parallel version of BLAS first-and second-level functions has poor performance,and can not even reach the serial version.Moreover,when the data scale cannot be evenly distributed by the slave core,the performance of these slave cores cannot be fully utilized.In view of the above characteristics,this paper analyzes the actual performance laws of the serial version and the parallel version when the first-and second-level functions of BLAS have different data sizes,designs multiple sets of running versions when the data sizes are different,and proposes an adaptive scheduling algorithm.The algorithm realizes the function of the master core to dynamically schedule the corresponding function version according to the data scale,so that the performance of the processor can be further brought into play.3.Aiming at the core computing part of GEMM,this paper proposes an optimization method based on C language level and assembly language level,and designs 32 × 32 and 64 × 64 two high-performance GEMM kernels with wide application scenarios provide strong support for larger-scale GEMM computing.Based on the level of C language,by combining the optimization methods such as vector register,SIMD vectorization and loop expansion,the number of entering the loop is reduced or some loops are eliminated without changing the amount of calculation.Based on the level of assembly language,the shortcomings and laws of compiler register allocation are found and summarized,and a register redistribution method is proposed;In the aspect of instruction level optimization,the methods of instruction fusion and instruction rearrangement optimization are proposed,which realizes the combination of multiple instructions into one,reduces the emission of instructions and multi-level pipeline parallelism.Experiments show that the computational efficiency of the optimized GEMM kernel has been significantly improved.The kernel computing efficiency of 32 × 32 is 89.4%.The kernel computing efficiency of 64 × 64 is 94.3%.
Keywords/Search Tags:Sunway Many-core Processor, Parallel Computing, BLAS, SIMD, Assembly Optimization
PDF Full Text Request
Related items