Font Size: a A A

Automatic Analysis And Optimization Of Performance Bottleneck Of GPU Programs Based On Profiling Information

Posted on:2022-04-15Degree:MasterType:Thesis
Country:ChinaCandidate:X Y QinFull Text:PDF
GTID:2518306314974119Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In recent years,GPU has gained a lot of attention in academia and industry with its ultra-high data processing speed and quality.In order to enable applications to make better use of GPU resources,NVIDIA provides highly encapsulated library covering common functions in various fields for users to use.However,due to its non-open source feature,users cannot edit library to achieve more complex task's requirements.As a result,in the process of developers writing GPU programs independently,how to make full use of GPU hardware resources is also one of the key research topics in academia.Therefore,developers who want to write high-quality CUDA kernel functions to run on the GPU not only need to understand the GPU architecture,instruction set and other hardware foundations,but also need to analyze the performance of the CUDA kernel through the profiling information of the GPU program in order to find the possible performance bottlenecks and optimization space of the kernel function.This paper uses Nsight Compute to obtain the profiling information of the CUDA kernel.We combine performance parameters such as pipe utilization and warp status with the source code and SASS assembly code to analyze the possible performance bottlenecks of the CUDA kernel function and optimize it accordingly.On the Turing and Ampere architectures,this paper explores the occupancy of various resources on SM by CUDA kernel in different fields,and designs automated algorithms to determine different types of performance bottlenecks based on profiling information,and presents the waste of computing resources or too negative compilation problem on CUDA kernel.For the above-mentioned performance bottlenecks,this article presents an optimization scheme that uses mixed-precision calculation instructions to replace single type calculation instructions and manually eliminates compilation problems.By deleting unnecessary instructions,increasing the hidden cover between instructions,and sharing the burden of some instruction modules,the blocking condition of the instruction pipe which has the performance bottleneck is reduced,and the pipe utilization is balanced or maximized as much as possible,so as to improve the performance of the GPU program.This article mainly uses the program in the Rodinia benchmark for experimental verification.The results show that the optimization can make the program performance on Turing and Ampere get a maximum of 5.06x,5.79x,and an average of 1.78x,1.96x.It is also verified the effectiveness and feasibility of the automatic performance bottleneck detection and optimization scheme proposed in this paper.
Keywords/Search Tags:GPU, performance analysis, pipe utilization, bottleneck
PDF Full Text Request
Related items