Font Size: a A A

Performance Evaluations And Applications On GPU Systems

Posted on:2018-11-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:M Q FangFull Text:PDF
GTID:1368330569498493Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Due to the limitations of power consumption and heat dissipation,it is impractical to futher enhance the performance of the chip by increasing frequency.And multi-core and many-core have become the mainstream of building a new processor.Because of its inherent high performance,low power consumption and high performance-price ratio,the many-core processor plays an outstanding role in high performance computing.But at the same time,it is difflicut to port heterogeneous programs,to optimize them on architecture,to use both the host processors and the co-processors,and to understand the mathematical principles and mechanisms.In this paper,we focus on performance optimization of many-core architectures.We derive the corresponding optimization strategies by microbenchmarking the interesting components of many-core architectures,which are applied in two real world applications.In this paper,we take GPU as an example,and focus on addressing the challenges of performance optimization on many-core architecture and in heterogeneous system.We design microbenchmarks to evaluate the components of the GPU memory system and the interconnecting components,derive the optimization strategies,and then apply them to the real world applications of hyperspectral image dimensionality reduction and sonar signal beamforming to get the maximum performance.Our major contributions include the following:(1)Focus on the optimization challenge on GPU memory,we propose a GPU memory optimization framework with microbenchmark,and then apply it on two real world applications to obtain excellent performance.Comparing to benchmarking the thread level latencies on different memory hierarchies,we propose a warp-level benchmark method,and design two experiments on parallel memory accessing.Such warp-level measurements are done on shared memory,constant memory,global memory and texture memory.Further we explore the strategies of replacing local memory by registers,avoiding bank conflicts of shared memory,and improving global memory bandwidth with different data types.By summarizing the optimization guidelines for different memory hierarchies,we construct a GPU memory optimization framework,and apply it to hyperspectral image dimensionality reduction and sonar signal beamforming.The results show the practicality and effectiveness of our framework.(2)Focus on the optimization challenge on CPU/GPU heterogeneous system,we explore the host memory,zerocopy,overlaping calculation and communication,overlaping calculation and calculation,and verify them by case studies.By measuring the memory bandwidth of host memory,the PCI-E bandwidth,the register and unregister overheads,we propose a piece-wise model for host memory allocation,and demonstrate the model usages and performance effects with a case study of PCA dimensionality reduction.Moreover,we propose and verify two optimization of reducing global memory access and voerlaping calculation and communication with zerocopy,and discuss the coordinated optimization techniques of overlaping calculation and communication,overlaping calculation and calculation by samples study.(3)Based on the optimization studies on GPU memory and heterogeneous system,we parallelize and optimize hyperspectral image dimensionality reduction.Fousing on three typical algorithms of principal component analysis(PCA),fast independent component analysis(FastICA)and maximum noise fraction(MNF)rotation,we identify the hotspots,and design the parallel schemes on distributed storage,shared memory and GPU for the hotspots of the covariance matrix calculation,PCA transformation,ICA iteration and filtering.Then,we investigate the optimizations and their effects of different hotspots on GPU system.Finally,we propose an efficient and portable parallel framework for hyperspectral image dimensionality reduction on multi-/many-cores platforms,and implement on CPUs,GPUs and Xeon Phis.The experimental results show that our parallel framework can get excellent performance.Gs-PCA,Gs-FastICA and G-MNF can abtain a speedup of upto 119.7X,106.6X and 86.9X,respectively.And we discuss the scalability of parallel dimension reduction algorithms.(4)We accelerate the broadband sonar signal beamforming algorithms of DFT-CBF and MVDR with the previous optimization studies on GPU memory and heterogeneous system.We focus on the DFT-CBF hotspots of the DFT,the CBF/Lofar calculation and the band energy statistics,and the MVDR hotspots of the DFT,bilateral Jacobi iterative(hermition matrix decomposition)and the azimuth spectrum statistics.We design GPU mapping schemes,develop optimizations on the GPU system and evaluate the optimization effects,and implement the parallel beamforming algorithms on GPU.We discuss the speedups and real-timeness of parallel beamforming algorithms with experiments.Our parallel DFT-CBF algorithm on GPU can process more than ten thousands elements beamforming in real time,and it can obtain a maximum speedup of 125.3 X.While,the Gs-MVDR algorithm with multiple GPUs can get a best speedup of 30.7 X.
Keywords/Search Tags:GPU, performance optimization, warp-level benchmarking, host memory selection, hyperspectral image dimensionality reduction, beamforming
PDF Full Text Request
Related items