Automatic Analysis And Optimization Of Performance Bottleneck Of GPU Programs Based On Profiling Information

Posted on:2022-04-15

Degree:Master

Type:Thesis

Country:China

Candidate:X Y Qin

Full Text:PDF

GTID:2518306314974119

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

In recent years,GPU has gained a lot of attention in academia and industry with its ultra-high data processing speed and quality.In order to enable applications to make better use of GPU resources,NVIDIA provides highly encapsulated library covering common functions in various fields for users to use.However,due to its non-open source feature,users cannot edit library to achieve more complex task's requirements.As a result,in the process of developers writing GPU programs independently,how to make full use of GPU hardware resources is also one of the key research topics in academia.Therefore,developers who want to write high-quality CUDA kernel functions to run on the GPU not only need to understand the GPU architecture,instruction set and other hardware foundations,but also need to analyze the performance of the CUDA kernel through the profiling information of the GPU program in order to find the possible performance bottlenecks and optimization space of the kernel function.This paper uses Nsight Compute to obtain the profiling information of the CUDA kernel.We combine performance parameters such as pipe utilization and warp status with the source code and SASS assembly code to analyze the possible performance bottlenecks of the CUDA kernel function and optimize it accordingly.On the Turing and Ampere architectures,this paper explores the occupancy of various resources on SM by CUDA kernel in different fields,and designs automated algorithms to determine different types of performance bottlenecks based on profiling information,and presents the waste of computing resources or too negative compilation problem on CUDA kernel.For the above-mentioned performance bottlenecks,this article presents an optimization scheme that uses mixed-precision calculation instructions to replace single type calculation instructions and manually eliminates compilation problems.By deleting unnecessary instructions,increasing the hidden cover between instructions,and sharing the burden of some instruction modules,the blocking condition of the instruction pipe which has the performance bottleneck is reduced,and the pipe utilization is balanced or maximized as much as possible,so as to improve the performance of the GPU program.This article mainly uses the program in the Rodinia benchmark for experimental verification.The results show that the optimization can make the program performance on Turing and Ampere get a maximum of 5.06x,5.79x,and an average of 1.78x,1.96x.It is also verified the effectiveness and feasibility of the automatic performance bottleneck detection and optimization scheme proposed in this paper.

Keywords/Search Tags:

GPU, performance analysis, pipe utilization, bottleneck

PDF Full Text Request

Related items

1	The Research On IP Network End-to-End Performance Bottleneck Based On Active Measurement
2	Research And Implementation Of Application Performance Bottleneck Analysis System In APM
3	Research And Implementation Of Measurement And Performance Bottleneck Analysis For Datacenter Network
4	Research And Practice Analysis Tuning LoadRunner's Web-Based Software Performance Bottlenecks
5	Performance Analysis And Optimizations Of Microprocessors
6	Design And Implementation Of The Mobile Internet Application Layer Based Smart Pipe
7	Pipe Performance Data Analysis System Applying Object-Oriented Technology
8	Error Analysis Techniques Of Performance Measurement Based On Hardware Performance Monitoring
9	Design And Performance Study On Heat Pipe Radiator With Connected Fins For A Integrated Desktop Its Heat Transfer Performance Study
10	Bottleneck Detection And Performance Prediction For Large-scale Complex Systems