Research Of Parallel Optimization Technicals On GPU Computing Platforms

Posted on:2013-08-06

Degree:Doctor

Type:Dissertation

Country:China

Candidate:H P Jia

Full Text:PDF

GTID:1268330401974105

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

More and more application developers have been adopting GPUs as standard computing accelerators because of their increasing computing power and programmability. However, it’s hard to get the required performance without careful optimizations because the performance problem has shifted from hardware designers to application developers. Unfortunately, performance optimizations of GPU programs are very difficult. The essence of this progress is to achieve the best match between algorithm features and the underlying hardware characteristics. On the one hand, this optimization process requires deep technical knowledge of the underlying hareware. Modern GPU architectures are becoming more and more diversified, which further exacerbates the already difficult problem of performance optimization. On the other hand, the characteristics of application programs ported to GPUs are also becoming increasingly diverse. Overall, these applications can be divided into two categories:regular applications and irregular applications. Optimization methods and strategies are very different for different programs running on different hardware platforms. In order to simplify optimizations of GPU programs and enable application developers write high performance GPU programs more easily. Considering the different characteristics of the differnent GPU applications, we divide our work into two parts:For regular applications, we propose the concept of performance optimization chain, and divide it into two categories:threshold optimization chain and tradeoff optimization chain according to the differences between GPU computing and memory access. We also make the optimization chain insightful by introducing Roofline model, and establish an insightful performance model for guiding optimizations on GPUs: GPURoofline. This model can provide performance information to identify GPU program performance bottlenecks and decide which optimization methods should be adopted. This model is useful for programmers, especially non-expert programmers with limited knowledge of GPU architectures to implement high performance GPU kernels directly. We aslo demonstrate the usage of GPURoofline by optimizing three representative GPU kernels with different compute intensity and program characteristics.For irregular applications, we take the Viola-Jones face detection algorithm as an example to intruoduce five key technologies for optimizing irregular applications on GPUs:coarse-grained parallelism, Uberkernel, Persistent Thread, local queue and global queue. We also propose a tunable GPU kernel by defining and extracting performance parameters and achieving the performance portability across different GPU platforms for the Viola-Jones face detection algorithm. We also demonstrate the high performance of our implementation by comparing it with a well-optimized CPU version from OpenCV library. Experimental results show that the speedup reaches up to5.19～27.724,6.468～35.080and5.850～28.768on AMD HD5850GPU, AMD HD7970GPU and NVIDIA C2050GPU respectively.In summary, our key contributions are as follows:1. Comparison and analysis of differences and similarities among the current mainstream GPU architectures. We propose three effective ways to improve performance of programs on GPUs:improving the utilization of the off-chip memory bandwidth, improving the utilization of the computing resource and data locality.2. Definitions of hardware compute intensity and algorithm compute intensity respectively. Starting from these definitions, we classified algorithms as memory-bound or computation-bound by measuring such features. Furthermore, we also build performance optimization chainm, and divide it into two categories:threshold optimization chain and tradeoff optimization chain according to the differences between GPU computing and memory access.3. GPURoofline:an empirical and insightful performance model for guiding performance optimizations. We make the optimization chain insightful by introducing Roofline model, so we can guide optimizations in a more intuitive way.4. We introduce five key technologies for optimizing irregular applications on GPUs:coarse-grained parallelism, Uberkernel, Persistent Thread, local queue and global queue. We demonstrate the usage of these five methods through implementing and optimizing the Viola-Jones face detection algorithm on GPUs. Finally, we complete a tunable GPU kernel by defining and extracting performance parameters. So as to vertify the possibility of performance portability across different GPU platforms.

Keywords/Search Tags:

GPU, Performance Optimization Chain, GPURoofline, Coarse-grained Parallelism, Local and Globalqueue

PDF Full Text Request

Related items

1	Parallel Optimization Technology For Satisfiability Problem Based On Multi-core Platform
2	Coarse-grained Speculative Parallelism and Optimization
3	General-purpose Algorithms Implementation And Optimization For Coarse-grained Dynamically Reconfigurable Processor
4	Configuration Optimization Research And Implementation For Coarse-grained Reconfigurable Processor
5	Java EE Gateway Website Performance Optimization Technique Research And Implementation
6	Research On Performance Optimization Of Coarse-grain Reconfigurable Array Processor
7	Design And Optimization Research Of Compiler Backend For Large-scale Coarse Grained Reconfigurable Architecture
8	Fine-grained and coarse-grained architectures for two-dimensional discrete wavelet transform
9	Research On Coarse-grained High-performance Reconfigurable Architecture And Automatic Design Methodology
10	Study On The Computation Performance And Application Of The Coarse-grained Parallel Genetic Algorithms