Research On Performance Model And Optimization Of Parallel Applications On Heterogeneous Platforms

Posted on:2022-11-26

Degree:Doctor

Type:Dissertation

Country:China

Candidate:G Z Lu

Full Text:PDF

GTID:1528306839477674

Subject:Cyberspace security

Abstract/Summary:

PDF Full Text Request

Machine learning and big data analysis have been applied in many areas,but the huge need for the computing power of machine learning and big data analysis can not be satisfied by the CPU only compute nodes.Therefore,the heterogenous platform based on CPU and many-core accelerator has become the main computing platform for machine learning and big data analysis.Among all many-core accelerators,GPU(Graphics Processing Unit)is the most prominent accelerator.However,the heterogenous platform can be very complex,especially for the complex memory and computing architecture of GPU,which makes parallel applications less efficient when running on GPU.To overcome this problem,we first build performance models for parallel applications on heterogenous platforms and analysis the performance of parallel applications.Then,we utilize multiple optimization algorithms to improve the memory access and computing resources utilization performance of regular and irregular parallel applications on GPU.Regular applications are data independent like general matrix multiplication and convolution.Irregular applications are data dependent like graph mining and graph computing.Specifically,the main research contents of this dissertation can be described as following four aspects:First,as the main performance analysis tool,performance modeling can demonstrate the execution behavior of parallel applications and support performance analysis.This thesis proposes two performance models for CPU and GPU respectively.When running parallel applications on CPU,developers focus on the scalability of the parallel application.However,existing performance models use loop and function level,which can not predicate the scalability of parallel applications at large scale.To overcome this problem,This thesis proposes a block level performance model,which can accurately predicate the performance and bottleneck of parallel applications at large scale.When running parallel applications on GPU,developers focus on how to utilize the GPU efficiently.However,existing GPU based performance models needs high overhead to collect performance data.To overcome this problem,this thesis proposes a GPU based performance model with low overhead.Second,in order to utilize GPU complex memory system efficiently,this thesis takes 2D convolution as an research object to study optimization methods for regular parallel applications on GPU.Existing methods converts 2D convolution into matrix multiplication,which involves a large number of duplicate memory accesses.To overcome this problem,this thesis designs two optimization techniques to reduce the redundant memory accesses.To reduce the number of duplicate column data accesses,we use shuffle instructions to exchange data between threads.To reduce the number of duplicate row data accesses,once a row is loaded,we use it to calculate as many output elements as possible.Compared to the mainstream 2D convolution libraries,our approach achieves 2times speedup,and reduces the number of memory accesses by 12.3%.Third,in order to utilize GPU computing resources efficiently,this thesis takes Depthwise Separable(DS)convolution as an research object to study GPU utilization improvement methods for irregular parallel applications.Existing methods use a fixed block size for DS convolution,which can not saturate GPU when the computation is low.To overcome this problem,we design a dynamic blocking scheme that can divide the output based on input data,available computing resources and computing intensity.Meanwhile,we design a channel distribution method to improve the computing intensity for each thread.Compared to the mainstream DS convolution libraries,our approach improves GPU utilization by 50%,and achieves 2 times speedup.Fourth,based on optimization methods for regular parallel applications,this thesis takes subgraph matching as an research object to study memory and computing resources utilization optimization methods for irregular parallel applications.Existing subgraph matching algorithms need to write intermediate data multiple times and do not utilize registers efficiently.To overcome this problem,we design a new graph storage format and parallel vertex matching method to reduce memory accesses.And also design an efficient intermediate data generation algorithm to utilize GPU resources efficiently.Compared to the state-of-the-art subgraph matching algorithms,our approach achieves an average speedup of 5 times.

Keywords/Search Tags:

heterogeneous platform, GPU, parallel applications, performance model, performance optimization, convolution, subgraph matching

PDF Full Text Request

Related items

1	Performance Heterogeneity-Oriented Convolution Neural Network Parallel Optimization
2	Research On Performance Optimization Of Heterogeneous Platform Based On CPU-GPU And Multicore Parallel Programming Model
3	Research On Parallel Algorithm Design And Optimization For Image Matching With Heterogeneous Architectures
4	Research On Parallelization Techniques Of Algebraic Cryptanalysis Based On Heterogeneous Platforms
5	Research And Implementation Of Backtracking-based Distributed Subgraph Enumeration On Large-scale Data Graphs
6	Computation Model And Performance Optimization On Shared Memory Architecture
7	Matching and scheduling of applications in heterogeneous computing systems with emphasis on high performance, reliability, and QoS
8	Performance Optimization And Type Synthesis Of Precision Positioning Parallel Platform
9	Performance Model For Parallel Convolutional Neural Network Based On OpenCL
10	Research On Parallel Computing Model For CPU/GPU Heterogeneous System