Research On Many-core GPU Architectures

Posted on:2012-09-07

Degree:Doctor

Type:Dissertation

Country:China

Candidate:G Chen

Full Text:PDF

GTID:1118330371465412

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

High-performance microprocessor architectures have involved into many-core ones due to the demand for large-scale data-level parallel applications with respect to scalability, computational power and memory bandwidth. As novel many-core architecture, modern GPUs employ massive of transistors for ALU resources. Meanwhile, modern GPUs have relative simple control logic and very high-efficient memory bandwidth. These prominent characteristics make modern GPUs very feasible to be used in general-purpose data-parallel computing domains with high performance/cost ratio, which forms a new research area called general-purpose computing on GPUs (GPGPU).Due to the constraint of architecture and programmability, earlier generations of GPUs do not be widely deployed in general-purpose computing domains. High-level programming models (e.g., ATM STREAMTM, NVIDIA CUDATM and OpenCL) reduce the learning curve of GPU programming. In order to save the design cost and maintain architectural scalability, modern GPUs always employ decentralized hardware framework. Compared with CPU memory system, GPU memory system is optimized for throughput rather than latency. GPUs are capable of supporting thousands of concurrently executing threads, with zero-cost hardware controlled context switching between threads to tolerance memory latency. However, many threads will access memory simultaneously if there are many irregular memory access patterns in the programs, which might waste computational resources by causing ALU stall. Hence, manual development of high-performance parallel codes for GPU architectures is still very challenging, because it is hard to fully exploit the architectural advantage of GPUs under high-level programming models. Achieving high performance need to consider how to effiently map these codes onto GPU hardware. Furthermore, given that there is significant difference between the sequential programming model for CPUs and the parallel programming model for GPUs, application development and optimization methods are diverse with respect to them. As a result of the complexity of the underlying hardware of GPU architectures, the compilers do not perform adequately optimizations. In order to fully exploit the computation capability of GPUs for general-purpose data-parallel computing, this paper investigates performance evaluation and optimization methods targeting GPU architectures. The contributions of this paper can be summarized as following: (1) When porting application programs to GPU architectures, many factors will degrade performance. A quantitative performance model will help to determine the actual potential of a specific application program to port to GPU architectures. Due to the complex architecture of GPUs, traditional parallel performance models are not applicable. In order to evaluate the potential execution performance of an application programs and identify performance bottlenecks, this paper presents a quantitative GPGPU performance model. Based on the abstract GPU architecture, the present model embodies various features of the GPU architecture which affect the performance of a GPGPU kernel such as global memory access, local memory access, overlapping memory access with useful computation, conditional branch divergence and synchronization. By statically analyzing an application program with considering of the specific execution configuration, the present model can approximately estimate the execution time of an application program without the need of writing the actual GPGPU kernel. Analytical and experimental results show that the present model can estimate the execution time of application programs relative accurately.(2) In GPU hierarchy memory system, global memory has larger size but with long access latency, fast memories (e.g., local memory) can be accessed with fast speed but with limited size. Hence, improving data layout in global memory to reduce irregular memory accesses and utilizing the fast memories effectively are crucial to reduce the overall overhead of memory accesses for a GPGPU kernel. In order to exerting the advantages of memory bandwidth of GPU architectures, this paper proposes memory optimization methods by means of polyhedral model. Polyhedral representation of a source program is built to optimize and allocate GPU memory system. By checking memory access patterns of the source program, access instances those can be grouped together are discovered by means of graph coloring. Subsequently, data space transformation is utilized to alter irregular memory access patterns for the sake of improving the off-chip memory bandwidth. Data reuse information is detected to allocate data into distinct fast memory regions according to both the properties of data accesses and the characteristics of the GPU memory model. Finally, coordinate conversion and offset inserting techniques are proposed with the purpose of optimizing the IMAGE memory object and the local memory bank conflict, making best usage of the fast on-chip memory. Experimental results for a set of benchmarks show that the optimized programs could achieve a speedup of 1.2x-8.4x compared to the un-optimized versions.(3) The loop and array constructs exhibit inherent computation intensiveness and data parallelism, which are naturally been parallelized in a GPU kernel. However, some application programs exhibit complex dependence relationships and control flows, hindering them efficiently executing on GPU architectures. Given that GPU architectures emphasize both computation intensiveness and data parallelism, combination computation restructuring with data restructuring can better improve the performance of GPGPU kernels. In order to map data-parallel applications onto GPU architectures with high performance, this paper proposes program restructuring methods targeting GPU architectures. By computation restructuring with respect to loop incorporation and splitting, it enhances the parallelism and the compute intensity of the application programs, improving the ALU resources utilization of GPUs for better memory latency hiding. By inter-thread and intra-thread data restructuring, it eliminates the irregular data access patterns, reducing the number of memory access which improves the available memory bandwidth of the application programs. By branch restructuring such as conditional execution, branch predigesting and indirect indexing, it decrease the negative impact of branch divergence. Experimental results show that the proposed restructuring methods could achieve a speedup of 1.18xï½ž2.56x compared to the un-restructured versions.(4) During the parallelizing process of non-compute-intensive applications in GPU architectures, memory wall is a prominent problem. By designing and implementing a novel parallel Smith-Waterman algorithm in biological sequence alignment, this paper proposes a common optimization strategy for memory-bound applications targeting GPU architectures. It enhances the parallelism by altering the computation flows and data dependencies of the original Smith-Waterman algorithm and performs several architecture-aware optimizations, greatly improving the performance and efficiency of biological sequence alignment algorithm. Experimental results show that the proposed algorithm can achieve almost 115 times speedup, relieving the memory wall problem in data-parallel applications efficiently.The performance evaluation model and optimization methods presented in this paper are helpful to provide insight for developing high-performance GPGPU applications and optimized compilers as well as other many-core achitectures.

Keywords/Search Tags:

Many-core, GPU, GPGPU, data-parallel, performance model, memory optimization, program restructuring

PDF Full Text Request

Related items

1	Computation Model And Performance Optimization On Shared Memory Architecture
2	Parallel Compilation And Optimization For Multi-core Processors
3	Research On Memory Race Recording Method For Multicore Parallel Program
4	Parallel Computational Model And Performance Optimization On Big Data
5	Parallel Program Execution Model On Data Communication Optimization
6	Research On Performance Optimization Of Heterogeneous Platform Based On CPU-GPU And Multicore Parallel Programming Model
7	Study And Implementation Of High Performance Parallel Hierarchy Stream Memory System
8	Study On Program Performance Optimization In Multi-Core Systems
9	Parallel Implementation And Performance Optimization For Refactoring GROMACS On The Sunway Many-core Architecture
10	Study On Parallel Programming Models