Font Size: a A A

Research On Performance Optimization Of General Purpose Graphics Processing Unit Based On Thread Scheduling

Posted on:2019-02-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:J ZhangFull Text:PDF
GTID:1368330545499886Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
GPGPU(General Purpose Graphics Processing Unit)has been one of the main speedup components for throughput and high performance fields.As GPGPU can support thousands of threads to execute concurrently,its performance can surpass the performance of CPU by several times,especially in the face of regular computing paradigm.It can hide the long latency incurred by off-chip memory access via efficient thread switching.With the development of GPGPU architecture,it has been used widely in the field of general purpose computing.However,there exist lots of irregular computing paradigm and irregular memory access paradigm in the field of general purpose computing.Moreover,on-chip resources contention especially cache contention is very easy to occur due to the concurrent execution of so many threads.All of these factors can affect the performance of GPGPU.At present,many scholars at home and abroad have done a lot of research work in these areas.The thesis studies the optimization of GPGPU performance based on analyzing the previous research finds,which mainly focuses on branch divergence which is the main factor of irregular computing paradigm,memory divergence which is the main factor of irregular memory access paradigm,and on-chip resources contention especially cache contention.1.To propose two-stage synchronization based thread block compaction and scheduling method aiming at branch divergence.Branch divergence not only reduces the thread level parallelism of the running task,but decreases the utilization of on-chip resources.Compaction and reorganization on thread block is one of the common methods to solve branch divergence.Nevertheless,existed methods have some shortages such as lower compaction and reorganization efficiency of thread block,larger time overhead incurred by thread synchronization etc.Based on the analyzation,the proposed two-stage based thread block compaction and scheduling method compacts and reorganizes thread blocks in two stages.The compaction and reorganization effectiveness are analyzed in each stage.And both overhead and benefit brought by thread block reorganization are compared.These two things make compaction and reorganization effectiveness improved effectively.In addition,the time overhead by thread synchronization is also reduced as only a fraction of threads need to be synchronized only in the first stage.2.To propose memory-aware TLP throttling and cache bypassing method aiming at on-chip resources contention.As there exist thousands of concurrent thread,it seems that on-chip resources of GPGPU are not enough apparently.This make on-chip resources contention especially cache contention occur very easily,even degrade the performance of GPGPU.TLP throttling and cache bypassing are the main methods used to solve this problem.The thesis proposes memory-aware TLP throttling and cache bypassing method based on analyzing the shortages of the existed methods.Unlike some methods taking TLP throttling at the granularity in thread block,the proposed method takes TLP throttling at the granularity in thread group.In order to detect cache contention,the proposed method takes sliding window mechanism.If cache contention occurs,the requested data with bad data locality is bypassed the cache via.As a result,the requested data with better data locality can use cache better,and the utilization of on-chip resources can be improved to some extent.After that,on-chip network is checked to decide whether congestion occurs.If it is,memory task is forbidden,and computing task can always be scheduled,which can maintain high TLP of task.3.To propose thread scheduling method based on memory access priority aiming at memory divergence.Memory divergence is incurred for issuing different memory requests by different threads in a thread group,which can result in many memory requests coming from the thread group sometime.The number of memory divergence and memory requests can be increased in the face of irregular memory task,which can make the memory performance of system go down.At present,reordering memory requests has been one of the main methods aiming at memory divergence,which can exploit data locality furtherly and merge more memory requests.On the basis of analyzing the shortages of existed methods,the thesis proposes thread scheduling method based on memory access priority.The main idea of the method is to make those thread groups with better data locality issue memory requests earlier,which make them run faster.The thread groups finishing execution earlier can release their occupied resources,which benefits for decreasing the on-chip resources contention.The proposed method quantifies the priority of memory request in the real sense for the first time.And the memory request with the largest priority is scheduled first.In addition,the waiting time of each memory request is considered so that starvation of the memory requests with bad data locality between thread groups and high memory divergence degree can be avoided.This can maintain fairness of thread scheduling to some extent.Experimental results show that these three proposed thread scheduling methods aiming at performance optimization for GPGPU can obtain better performance.
Keywords/Search Tags:General Purpose Graphics Processing Unit, Performance Optimization, Thread Scheduling, Branch Divergence, Memory Divergence
PDF Full Text Request
Related items