Font Size: a A A

Key Topics Researching For Single GPU And GPU Heterogeneous Cluster

Posted on:2014-12-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:K L ZhangFull Text:PDF
GTID:1228330434971187Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Modern GPUs are utilized gradually by more and more cluster systems as a high performance computing unit due to their powerful computational capability, high memory bandwidth, and high data level parallel architecture characateristics. Now, the GPU heterogeneous cluster is becoming the mainstream distribution computing platform for high performance computing applications, and making the modern computing technology is gradually entering the era of data level parallelism.Whether the data level parallel computing can be widely used or not in real world applications, not only depend on the designed and implemented data level parallel algorithms can obtain high performance based on the currently hardware platform, but also depend on he designed and implemented data level parallel algorithms can have sustainable scalability both on system-level (among different nodes) and node-level (inside of a signle node), which means the performance of these algorithms can achieve linear improvement as the computational power of hardware and memory bandwidth.In the background of funded by some research topics, we research both system-level and node-level scalability of GPU heterogeneous cluster comprehensively. For the scalability of system-level algorithms and applications, the main research works are the foliowings:(1) Designing and implementing a top-level programming framework (named DISPAR) based on the underlying hybrid programming framework (e.g, CUDA/MPI, OpenACC/MPI). The DISPAR framework has a well abstract description at application-level, architecture independence, well scalability, and so on. The DISPAR framework provides an efficiently system-level solution for the critical problems of applications under GPU heterogeneous cluster.(2) Implementing the transform from DISPAR source code to the underlying hybrid programming framework code (e.g, CUDA/MPI, OpenACC/MPI) via preprocessor ADTCM. An efficient task scheduling strategy and the corresponding algorithms are presented to realize an approximate optimal mapping between tasks and the computing resources.The straightforward performance improvement essentially derived from the applications at node-level. Performance of almost applications in the field of electronic design automation, scientific computation, and other general purpose computing are bounded by their critical operations, such as sparse matrix operations. Therefore, designing and implementing the well scalability and high efficient data level parallel algorithms for these critical operations at node-level is the key factor for whether exploiting the GPU resources or not. For the scalability of node-level algorithms and applications too, the main research works are the followings:(3) For providing the sustainable scalability of GPU at the level of hardware architecture, the contained hundreds or thousands of processing elements are organized and divided into multiple independent physical level SIMD engines. However, there are no similar synchronization primitives among different SIMD engines as the single SIMD engine. Although it can provide the synchronization capability among different SIMD engines via using atomtic operations, it would make the algorithm have no sustainable scalability because of the the nature of atomic operations are sequential. This paper based on the designing idea of sustainable scalability presents a general purpose or specific technology to make the designed algorithm have a well scalability. Such as the data level parallel odd-even merge sort and radix sort based on bucket partition preprocess, and the data level parallel band matrix vector multiplication based on anti-diagonal processing order. It makes those data level parallel algorithms have no data dependence, thus completely avoiding the synchronous operation and the corresponding atomic operation, making the parallel algorithm have well scalability.(4) Because the modern GPUs can support multiple kernels execution simultaneously, therefore for those algorithms do not have well scalability, we can also use the PTA algorithm presented in this paper to find out the efficient way to pakcing those kernels to a single packed kernel which can fully exploit the GPU resources.(5) Via redesigning and improving the scalability data level parallel algorithms for the timing analysis which is an important application in the field of electronic design automation, accelerating the processing speed of the key algorithms in the field of electronic design automation, and exploring the prospects of scalability data level parallel technology and the many-core processor technology in the field of electronic design automation. This paper presents a new sparse format ELLV for the statistical static timing analysis based on the sparse matrix framework. The ELLV format not only can make the corresponding data level parallel algorithm straightforward, but also make the parallel algorithm have well scalability. Moreover, the Jacobi preconditioning based on ELLV format can reduce a half memory accesses vervus based on ELLH format, and finally lead to a performance increasement about15%.
Keywords/Search Tags:Many-Core Architecture, Graphics Processing Unit, GeneralPurpose Computing on Graphics Processing Units (GPGPU), Parallel Processing, Data Level Parallel, Sorting Algorithm, Band Matrix Multiply Vector Operation, Precondititioning Techonlogy
PDF Full Text Request
Related items