Font Size: a A A

Research On Performance Model And Optimization Of Parallel Applications On Heterogeneous Platforms

Posted on:2022-11-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:G Z LuFull Text:PDF
GTID:1528306839477674Subject:Cyberspace security
Abstract/Summary:PDF Full Text Request
Machine learning and big data analysis have been applied in many areas,but the huge need for the computing power of machine learning and big data analysis can not be satisfied by the CPU only compute nodes.Therefore,the heterogenous platform based on CPU and many-core accelerator has become the main computing platform for machine learning and big data analysis.Among all many-core accelerators,GPU(Graphics Processing Unit)is the most prominent accelerator.However,the heterogenous platform can be very complex,especially for the complex memory and computing architecture of GPU,which makes parallel applications less efficient when running on GPU.To overcome this problem,we first build performance models for parallel applications on heterogenous platforms and analysis the performance of parallel applications.Then,we utilize multiple optimization algorithms to improve the memory access and computing resources utilization performance of regular and irregular parallel applications on GPU.Regular applications are data independent like general matrix multiplication and convolution.Irregular applications are data dependent like graph mining and graph computing.Specifically,the main research contents of this dissertation can be described as following four aspects:First,as the main performance analysis tool,performance modeling can demonstrate the execution behavior of parallel applications and support performance analysis.This thesis proposes two performance models for CPU and GPU respectively.When running parallel applications on CPU,developers focus on the scalability of the parallel application.However,existing performance models use loop and function level,which can not predicate the scalability of parallel applications at large scale.To overcome this problem,This thesis proposes a block level performance model,which can accurately predicate the performance and bottleneck of parallel applications at large scale.When running parallel applications on GPU,developers focus on how to utilize the GPU efficiently.However,existing GPU based performance models needs high overhead to collect performance data.To overcome this problem,this thesis proposes a GPU based performance model with low overhead.Second,in order to utilize GPU complex memory system efficiently,this thesis takes 2D convolution as an research object to study optimization methods for regular parallel applications on GPU.Existing methods converts 2D convolution into matrix multiplication,which involves a large number of duplicate memory accesses.To overcome this problem,this thesis designs two optimization techniques to reduce the redundant memory accesses.To reduce the number of duplicate column data accesses,we use shuffle instructions to exchange data between threads.To reduce the number of duplicate row data accesses,once a row is loaded,we use it to calculate as many output elements as possible.Compared to the mainstream 2D convolution libraries,our approach achieves 2times speedup,and reduces the number of memory accesses by 12.3%.Third,in order to utilize GPU computing resources efficiently,this thesis takes Depthwise Separable(DS)convolution as an research object to study GPU utilization improvement methods for irregular parallel applications.Existing methods use a fixed block size for DS convolution,which can not saturate GPU when the computation is low.To overcome this problem,we design a dynamic blocking scheme that can divide the output based on input data,available computing resources and computing intensity.Meanwhile,we design a channel distribution method to improve the computing intensity for each thread.Compared to the mainstream DS convolution libraries,our approach improves GPU utilization by 50%,and achieves 2 times speedup.Fourth,based on optimization methods for regular parallel applications,this thesis takes subgraph matching as an research object to study memory and computing resources utilization optimization methods for irregular parallel applications.Existing subgraph matching algorithms need to write intermediate data multiple times and do not utilize registers efficiently.To overcome this problem,we design a new graph storage format and parallel vertex matching method to reduce memory accesses.And also design an efficient intermediate data generation algorithm to utilize GPU resources efficiently.Compared to the state-of-the-art subgraph matching algorithms,our approach achieves an average speedup of 5 times.
Keywords/Search Tags:heterogeneous platform, GPU, parallel applications, performance model, performance optimization, convolution, subgraph matching
PDF Full Text Request
Related items