Research Of Optimize Method About Branch-intensive Application On GPU

Posted on:2015-12-20

Degree:Master

Type:Thesis

Country:China

Candidate:C Qian

Full Text:PDF

GTID:2348330509960586

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Nowadays, CPU+GPU is so widely used as a popular asynchronous model. Since CUDA programming framework being created by NVIDIA in 2006, more and more application in many special field has been accelerated by CUDA.CUDA framework introduced the concept of STMD As STMD improve the usage efficiency of hardware resource, it also brings some trouble. Warp is a significant concept in STMD framework as it’s the minimum unit for STMD’s creating and management. In every cycle all the threads in one Warp must execute a same instruction. So it’s obvious that when condition branch comes or in some other situations, Control Divergence also comes and it will seriously reduce the program’s efficiency.The main purpose of our work is to reduce the influence of Control Divergence in the fly and improve the program’s performance at the same time. In this article, we introduce a thread swapping based software optimization algorithm. This one can be applied in the real machine and reduce the Control Divergence resultful, and shorten the program running time correspondingly.Our main work can be summarize as follows:1)We analyse the reason of Control Divergence occurred in many branch-intensive application, and we classify applications according to the standard following:(1)EasyHandled and Hard-Handled,(2)Thread-Index Dependent and Thread-Data Dependent.Thought this we can divide CUDA program into four classes, special algorithm can be designed respectively.2)We introduce and design a thread swapping based optimization algorithm framework. Just as application being divided into several classes, we also divide the frame work into two types: DIMA and PIMA. After optimization, the control divergence in one program can be almost totally eliminated. The speedup of Kernel can reach about 1.5,and the speedup of the whole execute time can reach about 1.2.3)We research the influence of changing parameter. Thread swapping range is a considerable factor for Thread-Index Dependent application, while Pre-process time is also a considerable factor for PIMA-suitable application. Thought experiment we get some interesting conclusion that thread swapping range and Pre-process time are different according to different applications.4)We analyse the algorithm’s overhead and introduce a pipeline based latency hiding algorithm. This algorithm use asynchronous function Cuda Memcpy Asyc to overlap the transmission of data and Kernel executing. We can hide the overhead to improve program performance. The speedup of optimized program can be 1.1 in average compared to original one.

Keywords/Search Tags:

CUDA, Control Divergence, Thread Swapping based Optimization, Delay Hiding

PDF Full Text Request

Related items

1	Research Of Optimization Method About Branch Divergence And Irregular Memory Access On GPU
2	Research Of Thread Placement Optimization Strategy For CUDA Programs
3	Research On Performance Optimization Of General Purpose Graphics Processing Unit Based On Thread Scheduling
4	Research On Control Divergence Optimization Of GPGPU
5	Software Optimization Scheme On Control Flow Divergence For NVIDIA Maxwell GPGPU Applications
6	Study On Optimization Of Mobile Storage System Base On Swapping Method
7	Automatic Generation And Performance Optimization Of Code In Stencil Computation
8	The Research And Implementation Of The Key Techniques On Single Chip Multiprocessors
9	Key Research On Microarchitecture Of High Efficient GPU
10	Research And Implementation Of Transplant CUDA Program Based On Android