Research Of Optimization Method About Branch Divergence And Irregular Memory Access On GPU

Posted on:2016-06-23

Degree:Master

Type:Thesis

Country:China

Candidate:Q Yu

Full Text:PDF

GTID:2348330536967304

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

In recent years,GPUs have been widely accepted in general computing area and become an important part of high performance computing system.However,as GPUs use an execution model called SIMT,their efficiency is subject to the presence of branch divergence in a GPU application.To save memory bandwidth while reducing memory access latency,memory coalescing has been introduced to GPUs.Althouth this mechanism can help improve memory access effectiveness,irregular memory accesses still impact program performance greatly.To solve these two problems,this paper analysed their causes,then proposed corresponding optimization methods,finally tested some programs on GPGPU-Sim simulator.Besides,we discussed the impact of these methods we proposed has on the performance,power consumption and energy of programs we optimized.Main work of this paper can be summarized as follows:1)We analysed the cause of branch divergence as well as the influence existing thread swapping algorithm has on power consumption,then Reduction and Bitonic Sort was further optimized.For Reduction,existing thread swapping algorithm increases power consumption,to solve this problem,we modified the algorithm to reduce bank conflicts in shared memory,as a result,power consumption was reduced.Experiments show that,this method reduces power consumption by 5% with performance loss by 5% on average.For Bitonic Sort,we reduced some unnecessary swap based on the analysis of original thread swapping algorithm.Experiments show that,this method improves performance by 6% with no power consumption gain.2)We discussed the impact of thread swapping range has on program performance.Thread swapping range is an important parameter when thread swapping algorithm was used to optimize programs.We found that,the larger thread swapping range is,the more fully swap is conducted,the higher possibility of reducing branch divergence it has,at the same time,the more extra overheads it brings.As a consequence,how to select appropriate thread swapping range relies on specific programs.3)We analysed the cause of irregular memory access and proposed an optimization method based on matrix operations.Then some programs in PolyBench/GPU benchmark suit were optimized and tested.Experimental results show that,our proposed method is effective in reducing irregular memory accesses and is capable of producing significant performance improvements for tested programs.In certain condition,the highest performance speedup ratio reaches 78.9x while the average speedup ratio of kernel is 35.9x,the power consumption increases 106.2% on average while energy consumption reduces86.2% on average.4)We concluded the reason of power consumption change,i.e.the total power is the sum of power produced by every GPU component,besides,the change of performance counters associated with each GPU component and program execution time account mainly for power consumption change before and after optimization.In addition,we researched the impact the size of shared memory has on the optimization results of programs in PolyBench/GPU.Results show that,the larger size of shared memory,the higher speedup ratio and little power consumption change,so the optimization effect is better.

Keywords/Search Tags:

Branch Divergence, Irregular Memory Access, Thread Swapping, Matrix Operation, Performance, Power Consumption

PDF Full Text Request

Related items

1	Research Of Optimize Method About Branch-intensive Application On GPU
2	Research On Performance Optimization Of General Purpose Graphics Processing Unit Based On Thread Scheduling
3	Research On The Main Factors Influencing Power Consumption Of CUDA Program
4	Research And Implementation Of CNN-Oriented Low Power Consumption SRAM Array
5	The Research Of On-chip Memory System For Low Power Consumption In Universal High-performance Microprocessors
6	Exploring the memory hierarchy design with emerging memory technologies
7	Design Of Low Power Direct Memory Access Module In Microcontroller Unit
8	Supporting Applications Involving Dynamic Data Structures and Irregular Memory Access on Emerging Parallel Platforms
9	Design Of Large Capacity STT MRAM Controller
10	Automatic Generation And Performance Optimization Of Code In Stencil Computation