Research And Implementation On Compiler Framework For Translating Ansic C Into CUDA C

Posted on:2012-08-11

Degree:Master

Type:Thesis

Country:China

Candidate:Q Zhu

Full Text:PDF

GTID:2218330362960097

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Recently, GPU (Graphics Processing Unit) has been widely used in high-performance computing applications such as biomedical, financial analysis, physical simulation and database processing due to powerful computing capability of GPU. GPGPU (General-purpose computation on GPU) is proposed when GPU is used in the other areas except graphics rendering.It is a chanllenge to make best use of computing resources due to GPU complexity CUDA (Compute Unified Device Architecture) introduced by NVIDIA Corporation has provided an efficient solution for managing many-core processors such as GPUs. Compared with previous programming models, there are two improvements in CUDA. One is the introduction of a unified processing architecture, and the other is the employment of on-chip shared memory. These improvements make GPU more suitable for general purpose computing with the help of CUDA. However, programmers should be familiar with underlying architecture of GPU in order to develop high-performance applications because of the muti-level thread structure and memory hierarchy in CUDA-enabled GPU.We propose a source-to-source compiler framework to reduce the burden for GPU programmers. The framework is not only able to automatically generate applications which can excuse on heterogeneous system composed of CPU and GPU, but also can perform optimization to improve parallelism of the applications. The innovative work in thes thesis can be summarized as follows.1. A compiler framework for translating ansic C into CUDA code is proposed. ICuda is able to release the programmer from the details of the structure of the GPU and CUDA, and to improve the efficiency of developing high-performance parallel programs. Compared with most of the existing frameworks which are only adapted to the determinant or matrix-based applications, ICuda is able to deal with general applications.2. A scheduling approach of paralleling loop structure is proposed. The structure of data and loop should be transformed in order to adapt to the multi-core architecture of programming model when the serial code is parallelized. Accordingly, we propose the scheduling approach of subscript transformation and distributing the access to shared variables. Before the parallelization, nested sequential loops are transformed to single sequential loop so as to simplify the index of data. During the parallelization, the access to the shared variable is distributed to the different copies in order to reduce the overhead of the memory access.3. We propose a novel optimization for memory access based on CUDA. The single most important performance consideration in programming for CUDA architecture is coalescing global memory accesses. However, it is difficult to transform non-coalescing accesses into coalescing mode. Alternatively, we propose a novel and efficient strategy to achieve memory optimization by binding read-only data into text memory space.We implement the ICuda framework based on SUIF2 system. We adopt Parboil as the benchmark and propose a detail scheme of performance testing and analysis. Experimental results demonstrate the validity and efficiency of the optimization method and ICuda framework presented in this thesis.

Keywords/Search Tags:

GPGPU, CUDA, parallelization, compilation optimization

PDF Full Text Request

Related items

1	Source-to-Source Parallelization Research Of Loop For CUDA
2	Research And Implementation Of Transplant CUDA Program Based On Android
3	Research On Parallelization Of Deep Learning Algorithms Based On GPU
4	Research On Automatic Parallelization And Optimization Technologies For Shared Memory Architecture
5	GPGPU Performance Estimation And Optimization
6	Implementation And Structural Optimization Of The LDPC Decoder Based On Cuda Platform
7	CUDA-CHiLL: A programming language interface for GPGPU optimizations and code generation
8	Broadband And Narrowband Radar Signal Processing Based On GPGPU
9	Research On OpenMP Program Compilation And Optimization Techniques For Domestic Processors
10	Iterative Compilation For Large-scale Programs Based On Parallelization And Code Outlining