Font Size: a A A

Research Of Optimize Method About Branch-intensive Application On GPU

Posted on:2015-12-20Degree:MasterType:Thesis
Country:ChinaCandidate:C QianFull Text:PDF
GTID:2348330509960586Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Nowadays, CPU+GPU is so widely used as a popular asynchronous model. Since CUDA programming framework being created by NVIDIA in 2006, more and more application in many special field has been accelerated by CUDA.CUDA framework introduced the concept of STMD As STMD improve the usage efficiency of hardware resource, it also brings some trouble. Warp is a significant concept in STMD framework as it’s the minimum unit for STMD’s creating and management. In every cycle all the threads in one Warp must execute a same instruction. So it’s obvious that when condition branch comes or in some other situations, Control Divergence also comes and it will seriously reduce the program’s efficiency.The main purpose of our work is to reduce the influence of Control Divergence in the fly and improve the program’s performance at the same time. In this article, we introduce a thread swapping based software optimization algorithm. This one can be applied in the real machine and reduce the Control Divergence resultful, and shorten the program running time correspondingly.Our main work can be summarize as follows:1)We analyse the reason of Control Divergence occurred in many branch-intensive application, and we classify applications according to the standard following:(1)EasyHandled and Hard-Handled,(2)Thread-Index Dependent and Thread-Data Dependent.Thought this we can divide CUDA program into four classes, special algorithm can be designed respectively.2)We introduce and design a thread swapping based optimization algorithm framework. Just as application being divided into several classes, we also divide the frame work into two types: DIMA and PIMA. After optimization, the control divergence in one program can be almost totally eliminated. The speedup of Kernel can reach about 1.5,and the speedup of the whole execute time can reach about 1.2.3)We research the influence of changing parameter. Thread swapping range is a considerable factor for Thread-Index Dependent application, while Pre-process time is also a considerable factor for PIMA-suitable application. Thought experiment we get some interesting conclusion that thread swapping range and Pre-process time are different according to different applications.4)We analyse the algorithm’s overhead and introduce a pipeline based latency hiding algorithm. This algorithm use asynchronous function Cuda Memcpy Asyc to overlap the transmission of data and Kernel executing. We can hide the overhead to improve program performance. The speedup of optimized program can be 1.1 in average compared to original one.
Keywords/Search Tags:CUDA, Control Divergence, Thread Swapping based Optimization, Delay Hiding
PDF Full Text Request
Related items