Key Topics Researching For Single GPU And GPU Heterogeneous Cluster

Posted on:2014-12-11

Degree:Doctor

Type:Dissertation

Country:China

Candidate:K L Zhang

Full Text:PDF

GTID:1228330434971187

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

Modern GPUs are utilized gradually by more and more cluster systems as a high performance computing unit due to their powerful computational capability, high memory bandwidth, and high data level parallel architecture characateristics. Now, the GPU heterogeneous cluster is becoming the mainstream distribution computing platform for high performance computing applications, and making the modern computing technology is gradually entering the era of data level parallelism.Whether the data level parallel computing can be widely used or not in real world applications, not only depend on the designed and implemented data level parallel algorithms can obtain high performance based on the currently hardware platform, but also depend on he designed and implemented data level parallel algorithms can have sustainable scalability both on system-level (among different nodes) and node-level (inside of a signle node), which means the performance of these algorithms can achieve linear improvement as the computational power of hardware and memory bandwidth.In the background of funded by some research topics, we research both system-level and node-level scalability of GPU heterogeneous cluster comprehensively. For the scalability of system-level algorithms and applications, the main research works are the foliowings:(1) Designing and implementing a top-level programming framework (named DISPAR) based on the underlying hybrid programming framework (e.g, CUDA/MPI, OpenACC/MPI). The DISPAR framework has a well abstract description at application-level, architecture independence, well scalability, and so on. The DISPAR framework provides an efficiently system-level solution for the critical problems of applications under GPU heterogeneous cluster.(2) Implementing the transform from DISPAR source code to the underlying hybrid programming framework code (e.g, CUDA/MPI, OpenACC/MPI) via preprocessor ADTCM. An efficient task scheduling strategy and the corresponding algorithms are presented to realize an approximate optimal mapping between tasks and the computing resources.The straightforward performance improvement essentially derived from the applications at node-level. Performance of almost applications in the field of electronic design automation, scientific computation, and other general purpose computing are bounded by their critical operations, such as sparse matrix operations. Therefore, designing and implementing the well scalability and high efficient data level parallel algorithms for these critical operations at node-level is the key factor for whether exploiting the GPU resources or not. For the scalability of node-level algorithms and applications too, the main research works are the followings:(3) For providing the sustainable scalability of GPU at the level of hardware architecture, the contained hundreds or thousands of processing elements are organized and divided into multiple independent physical level SIMD engines. However, there are no similar synchronization primitives among different SIMD engines as the single SIMD engine. Although it can provide the synchronization capability among different SIMD engines via using atomtic operations, it would make the algorithm have no sustainable scalability because of the the nature of atomic operations are sequential. This paper based on the designing idea of sustainable scalability presents a general purpose or specific technology to make the designed algorithm have a well scalability. Such as the data level parallel odd-even merge sort and radix sort based on bucket partition preprocess, and the data level parallel band matrix vector multiplication based on anti-diagonal processing order. It makes those data level parallel algorithms have no data dependence, thus completely avoiding the synchronous operation and the corresponding atomic operation, making the parallel algorithm have well scalability.(4) Because the modern GPUs can support multiple kernels execution simultaneously, therefore for those algorithms do not have well scalability, we can also use the PTA algorithm presented in this paper to find out the efficient way to pakcing those kernels to a single packed kernel which can fully exploit the GPU resources.(5) Via redesigning and improving the scalability data level parallel algorithms for the timing analysis which is an important application in the field of electronic design automation, accelerating the processing speed of the key algorithms in the field of electronic design automation, and exploring the prospects of scalability data level parallel technology and the many-core processor technology in the field of electronic design automation. This paper presents a new sparse format ELLV for the statistical static timing analysis based on the sparse matrix framework. The ELLV format not only can make the corresponding data level parallel algorithm straightforward, but also make the parallel algorithm have well scalability. Moreover, the Jacobi preconditioning based on ELLV format can reduce a half memory accesses vervus based on ELLH format, and finally lead to a performance increasement about15%.

Keywords/Search Tags:

Many-Core Architecture, Graphics Processing Unit, GeneralPurpose Computing on Graphics Processing Units (GPGPU), Parallel Processing, Data Level Parallel, Sorting Algorithm, Band Matrix Multiply Vector Operation, Precondititioning Techonlogy

PDF Full Text Request

Related items

1	Accelerating scientific computation in bioinformatics by using graphics processing units as parallel vector processors
2	Application Of Graphics Processing Unit In Matrix Inversion And Normal Mode Analysis
3	Design And Realize The Ip Core Of 3d Embedded Graphics Processing Unit Based On Fpga
4	Design And Realize The IP Core Of 3D Embedded Graphics Processing Unit Based On FPGA
5	An Area And Bandwidth Efcient Programmable Shader Architecture For Embedded Graphics Processing Units
6	Research On Dynamic Signature Verification With Parallel Computing On GPU
7	Study On Application Of Parallel Computing Based On GPU
8	The Research Of Parallel FastSLAM Algorithm Based On CUDA
9	Accelerate Eda Algorithm Based On Multi-core Processors In Parallel
10	General Purpose Parallel Computation On The Modern Graphics Processor Units