Font Size: a A A

Design And Implementation Of BLAS On Multicore Vector Processor

Posted on:2015-08-27Degree:MasterType:Thesis
Country:ChinaCandidate:J Y ZhangFull Text:PDF
GTID:2308330479479467Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Basic Linear Algebra Subprograms library is one of the most widely cited in math library for scientific computing, which is also the core of computer performance evaluation algorithm.X-DSP is my school self-developed high-performance multi-core vector processors,mainly used in the field of digital communications,graphics and radar signal processing,etc. In order to give full performance of X-DSP, For it’s specific structural features,development of highly optimized BLAS library function has important theory significance and application value.In this paper,we deeply study and analysis the structural charteristic of X-DSP and the algorithm characteristic of BLAS1 and BLAS2,for multi- level parallelism of X-DSP architecture and detailed design a highly efficient algorithm mapping,and based on the manual compilation to achieve a high performance BLAS1 and BLAS2 algorithm library. The research job of this paper mainly includes the following aspects:Design and implement BLAS1 based on single-core of X-DSP,taking Vector norm,Matrix norm and DDOT as example,detailed analysis and design of algorithm for this DSP architectural features,With software pipelining and manual compilation optimized to achieve the processor’s instruction parallelism,vector parallelism and data parallelism.Test results on the RTL- level platform that the average performance of the Vector norm is 64.49 GFLOPS, Matrix norm is 49.34 GFLOPS,DDO T is 47.50 GFLOPS.Design and implement BLAS2 based on single-core of X-DSP,Analysis and study five routine computing characteristics of GEMV,SUM_MV,GER,TRMV and TRSV deeply;improved the Calculation method of GEMV; optimized design of SUM_MV on multi-cycle write conflict;with GER,GER2,SYR,SYR2 comparative analysis;pair TRMV into a triangular matrix block to reduce the value 0 to participate in operation; analyzed the update method of TRSV;According to X-DSP hardware mechanism, map GEMV,SUM_MV,GER,TRMV and TRSV five routines to the single-core of X-DSP efficiently, the performance is 93.67,53.06,63.85,92.85,62.78 GFLOPS.Design and implement parallel BLAS2 based on multi-core X-DSP.Depth study of the main subroutine parallelism of BLAS2,according to the structure of X-DSP, find suitable parallel solutions to this architectures,Achieve the block data parallel,decomposition to calculation between blocks are independent of each other;Through optimized multi-core load balancing.Based on X-DSP multicore communication and synchronization mechanism, map the parallel block algorithm to the X-DSP multi-core efficiently, The average of speedup for twelve-core parallel is 5.63.
Keywords/Search Tags:Multi-core vector processors, Matrix-Vector Multiplication, Norm Optimize, Parallel, B locked algorithms, Basic linear algebra subprograms
PDF Full Text Request
Related items