Font Size: a A A

Optimization Of Irregular Batched Matrix Multiplications On New Generation Shenwei Many-core Processor

Posted on:2023-07-15Degree:MasterType:Thesis
Country:ChinaCandidate:L XuFull Text:PDF
GTID:2558306905993939Subject:Computer system architecture
Abstract/Summary:
As the key operation in high-performance and intelligent computing,the batched matrix multiplication has two characteristics of dense computing and batched computing,and its implementation and optimization on emerging many-core processors are critically important.Based on the new generation Shenwei many-core processor,this paper deeply analyzes the computing characteristics and memory access behavior of batched matrix multiplications and designs corresponding parallel algorithms for various irregular matrix multiplications.And the memory access and computing are further optimized to achieve their balance.Finally,algorithms are selected and scheduled dynamically by algorithm selector and task scheduler to make full use of the powerful computing capacity of many-core processors.The main contributions of this paper include the following aspects:1.We propose a two-layer algorithm framework including the execution layer and decision layer.The execution layer designs four algorithms for single CPE,single cluster,four clusters,and entire core group,respectively,according to the processor architecture to adapt to various scales and shapes of matrices.And there are the forward and backward sending,cyclic shift,horizontal and vertical dual broadcast on-chip parallel algorithms and data sharing strategies at the execution layer for different computational resources each algorithm utilizes,which effectively improves computational parallelism and data reusability.The decision layer is responsible for algorithm selecting and task scheduling.The algorithm selector based on multi-class SVM achieves an effective selection of algorithms,with an accuracy of over 95.3%.And the dynamic task scheduler keeps the balance among compute units and solves the long tail problem of static task scheduling.2.The balance of computing and memory access is achieved via data prefetch and adaptive block algorithm.To bridge the performance gap between computing and memory access,a hybrid prefetch strategy of DMA and RMA is employed,which effectively hides memory access and communication overhead,and improves the overall performance by more than 80%.The operational intensity of the algorithm is related to the loop order and its block sizes,linking computing and memory access.To smooth the performance jitter of irregular matrix multiplications,a block search engine of coordinate descent is proposed aimed at operational intensity,which adaptively gives the optimal loop order and its block sizes.Besides,the Pack-Pad-Adjust algorithm combines transposed padding with data prefetch to hide the transpose cost and save LDM space.3.We design a novel compute kernel towards the new generation Shenwei instruction set.The instruction selection,register block and instruction scheduling are implemented by manual assembly to pursue high performance.The matrix multiplication and scalar multiplication in the micro-kernel are implemented by the vlenma and vlenmas instructions with the outer product form.And the optimal block sizes are selected according to the computational parallelism and operational intensity of registers.Through the analysis of instruction dependence,the instruction scheduling technique based on register buffering is used to expand the safe intervals between instructions to achieve dual issue of compute and memory access instructions,improving the overall performance by more than 380%.This paper achieves highly efficient irregular batched matrix multiplications on the new generation Shenwei processor through the above work.The performance of the entire core group algorithm can reach 2 TFLOPS,which is close to the theoretical peak performance,and each algorithm can maintain its performance advantage under most matrix sizes.
Keywords/Search Tags:parallel algorithm, Shenwei many-core processor, dense computing, batched computing, irregular matrix multiplication
Related items