Study On Acceleration Technology Of Image Processing Targeting Embeded GPU Platform

Posted on:2021-09-22

Degree:Master

Type:Thesis

Country:China

Candidate:B L Chen

Full Text:PDF

GTID:2518306503474224

Subject:IC Engineering

Abstract/Summary:

PDF Full Text Request

With the development of information technology,the computational complexity of image processing algorithms continues to increase.At the same time,the massive amount of information processing raises increasing demands on the computing capability of real-time processing systems.The graphics processing unit(GPU)has the advantages of strong parallel processing capability and high throughput,and it has received widespread attention in image processing systems.The paper is oriented to two representative image processing algorithms,two-dimensional convolution and two-dimensional fast Fourier transform(FFT).The parallel acceleration technology on embedded GPU is studied,and the related image processing application system is developed.First,based on the two-dimensional convolution and two-dimensional FFT algorithms,the performance bottlenecks of program on the GPU are analyzed,and related performance models are established.Based on this,the model is used to analyze the cause of performance bottlenecks in the execution of instruction pipeline,global memory fetch and shared memory fetch,effectively predict the current performance of the algorithm,and judge the potential performance optimization space size.Experiments show that the error between the analysis results obtained based on this model and the actual execution results is between 5% and 18%.Secondly,based on this model,the memory access bottleneck when the two-dimensional convolution algorithm is executed on the GPU is analyzed,and a global memory access technology based on a rotating strategy is proposed,which achieves a global memory bandwidth utilization rate close to 100%.Based on this,the parallel program development of the two-dimensional convolution algorithm is completed.The experimental results show that the performance of our convolution program is 9?14 times higher than the library functions such as NPP and CUFFT when the convolution kernel size is 7�7 to 11�11.Next,the parallel execution characteristics of the selection of the decomposition method and reverse ordering in the one-dimensional FFT are studied,and a shared memory access mechanism based on the butterfly operator access span is proposed,achieving 100% bandwidth utilization.Based on this,matrix transposition and batch column processing mechanisms are proposed to solve the problem of discontinuous memory access during column transformation of two-dimensional FFT.Experimental results show that the performance of our FFT program is improved by 5%?13% compared to the CUFFT library function when the image size is from 1024�1024 to 4096�4096 pixels.Finally,based on the above research results,we have developed two practical applications of image front-end processing and Fourier ptychographic microscopy(FPM)on the NVIDIA JETSON TX2 embedded platform.The processing speed respectively reaches 4K@60FPS and4MP@34FPS,meets the real-time requirements of the system,and has important reference value.

Keywords/Search Tags:

image processing, GPU, parallel program development, performance model, memory access mechanism

PDF Full Text Request

Related items

1	Optimizing performance on massively parallel computers using a remote memory access programming model
2	Model-integrated program synthesis for real-time image processing
3	Computation Model And Performance Optimization On Shared Memory Architecture
4	The Research And Implementation Of Parallel Program Performance Visualization System Based On Event
5	Study On Parallel Programming Models
6	The Design Of A 64-bit High-Performance DSP Parallel Memory Unit
7	Research On Automatic Generation Of Analytical Performance Model For Parallel Program
8	Study And Implementation Of High Performance Parallel Hierarchy Stream Memory System
9	Research And Implementation Of Parallel Program Performance Analysis System
10	Analysis And Optimization Mechanism Of Memory Access Model In Virtualized Environment