Font Size: a A A

Design And Implementation Of High Performance Parallelization Level3 BLAS On Multicore DSP

Posted on:2014-12-06Degree:MasterType:Thesis
Country:ChinaCandidate:L ChenFull Text:PDF
GTID:2308330479979462Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In the high performance computing arena, the BLAS library plays a very important role and is an important indicator of the potential performance of novel architectures. Research on parallel BLAS library based on C6678 multi-core DSP is very significant for the evaluation and application in the field of high performance and the parallel development for the multi-core C6678 DSP.The paper researches deeply the various routines of BLAS3 library, it designs and implements GEMM, SYMM, SYRK, SYR2 K and TRMM based on C6678 single-core DSP; Careful consideration multicore communication and synchronization mechanisms, it designs and implements parallel GEMM,SYMM,SYRK,SYR2 K and TRMM based on C6678 multi-core DSP. The research job of the paper mainly includes the following aspects:Design and implement GEMM based on C6678 single-core DSP. On the basis of in-depth research of the GEMM routine, dissect out the kernel loop affecting performance; Comparison and analysis the ratio between computation and data movement based on the architecture features; According to C6678 hardware resources and the architectural features, design and optimize memory access and computational data stored, and rational division of storage space; A high-performance GEMM is tested and performance is up to 8.49 GFLOPS.Design and implement BLAS3 based on C6678 single-core DSP. Analysis and study four routine computing characteristics of SYMM,SYRK,SYR2 K and TRMM deeply; Optimize and design symmetric matrix data access of SYMM; Optimize and design BP kernel of SYRK which updates symmetric matrix; Calculation of SYR2 K way of conversion so that it can be direct to call SYRK interface routines; Analysis triangular matrix data access of TRMM, according to the diagonal data features, optimize and design BP kernel; According to C6678 hardware mechanism, map SYMM,SYRK,SYR2 K and TRMM four routines to the C6678 single-core DSP efficiently, the performance is 8.241, 8.102, 8.008, 8.203 GFLOPS.Designs and implements parallel BLAS3 based on C6678 multi-core DSP. On the basis of in-depth research of the various routine, Achieve the block data parallel decomposition to calculation independent of each other; Optimize multi-core load balancing; Binding C6678 multicore communication and synchronization mechanism, map the block algorithm to the C6678 efficiently, GEMM、SYMM、SYRK、SYR2K and TRMM of speedup for eight-core parallel is 6.21、5.22、4.49、4.49 and 4.55.
Keywords/Search Tags:multi-core processors, parallel, basic linear algebra subprogrms, matrix multiplication, blocked algorithms
PDF Full Text Request
Related items