High Level Programming Model And Compiler Optimizations For CPU-GPU Heterogeneous Systems

Posted on:2014-01-18

Degree:Doctor

Type:Dissertation

Country:China

Candidate:X Q Li

Full Text:PDF

GTID:1228330398964268

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

With different characteristics of underlying architecture, CPU-GPU heterogeneous system can handle general purpose computation tasks more efficiently than homogeneous system. However, GPU has more complicated memory hierarchy and separate address space with host CPU, which make it hard to program heterogeneous system. For example, a thorough understanding of the algorithm and underlying hardware architecture is required to fully utilize the computing power; poor portability of computation kernel.In this thesis, we focuses on solving the problems of the programming model for heterogeneous systems by using high level programming model and compiler optimization techniques. We conduct our research on programmability and portability of the programming model, and automatic performance optimization techniques in compiler. The main contributions are as follows:(1) Study the requirement on expression ability of directive based language and extend OpenHMPP to satisfy the requirement. We first study the advantages and disadvantages of different types of solutions that focus on increasing the programmability and portability of heterogeneous systems, and demonstrate the advantages of directive based language. Then, we add minimum extensions to OpenHMPP and get a new directive based language:OpenHMPP+, which provides better expression ability and performance optimizations.(2) Design a complete source-to-source compiler framework that compiles OpenHMPP+directive based language to optimized CUD A language. Combined the programmability and portability provided by OpenHMPP+and performance guaranteed by a variety optimization techniques implemented in the compiler, our solution achieves the balance of programmability, portability and performance. We implement a variety of optimization techniques and assistant techniques in our compiler, and focus on arranging the order of these technologies so that they can perfectly collaborate with each other.(3) Study the related technologies in the implementation of our compiler. An algorithm to remove false dependency are proposed to increase the applicability of Cetus loop parallelism detection algorithm; the algorithm that map multiple parallel loops to CUDA threading model are proposed to maximally utilize the detected loop parallelism; an algorithm that reduce the total array access reuse distance are proposed, which provides chance for other optimizations; the algorithm to select optimal kernel run configuration, so that in the case of low degree of parallelism, different SM can get substantially the same amount of computation; propose a runtime system that select multi-core CPU or GPU as the device to run kernel according to different execution environment, so that the program can be executed by multi-core CPU on systems that not equipped with CUDA-enabled GPU or equipped with GPU that has insufficient device memory.(4) Design an experiment and evaluate the compiler framework. Fifteen applications from different domain are chosen for evaluation. The results show that the average performance of CUDA code generated by our compiler can achieve70%of hand written CUDA code.During the design and analysis on the high level directive-based language and source to source compiler, some important conclusions could be drawn as followings: Firstly, when the algorithm is inherently parallel, compiler can generate parallel kernels with little information such as position of parallel regions and loop parallelism in the region. Secondly, converting non-coalesced memory access to coalesced ones and utilizing reuse information are two key optimizations. Finally, the newly added L2Cache in Fermi architecture is good for many applications, but there should be an efficient replacement algorithm for stream applications. And some applications will benefit from large register address space.

Keywords/Search Tags:

GPU, heterogeneous systems, programming model, programmability, scalability, performance optimization, runtime system, source to sourcecompilation

PDF Full Text Request

Related items

1	Research On Heterogeneous System Oriented Parallel Programming
2	Research On Accelerator-centric Programming Model And Optimizations For Heterogeneous Computing Systems
3	PyDac: A distributed runtime system and programming model for a heterogeneous many-core architecture
4	Research On Performance Optimization Of Heterogeneous Platform Based On CPU-GPU And Multicore Parallel Programming Model
5	Research Of Parallel Computing On CPU/GPU Heterogeneous Architecture
6	Enabling Runtime Profiling to Hide and Exploit Heterogeneity within Chip Heterogeneous Multiprocessor Systems (CHMPS)
7	Program Feature Analysis And Optimization On CPU&GPU Heterogeneous Many-core Integrated Processors
8	Research On Programming Models And Optimizations For Petascale CPU-GPU Heterogeneous Computing Systems
9	Research On General Computation Model For CPU_GPU Heterogeneous System
10	Analysis And Extension Of The Typical Programming Model For Heterogeneous Platforms