Font Size: a A A

Research On GPU Parallel Techniques Based On Application

Posted on:2018-04-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y G XueFull Text:PDF
GTID:1368330623450472Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the progress of techniques in various fields of research,the application scale of multimedia,scientific computing and engineering simulation is increasing.Therefore,the performance requirements on computational platform becomes higher and higher.In the past,the processor's computational power relies on it's clock speed,but this way has been given up because of its energy problem.Multi/many-core technology is the newest solution,which is the broad consensus of academia and industry.GPU,as the presentatives of new type processor,has the advantages of high-efficiency and high peak performance and promotes the development of parallel computation.Meanwhile,the large-scale heterogenous system,such as CPU-GPU,is one of the development directions of the high performance computing.The GPU hardware and architecture continiously introduces new,and yields remarkable results.Howerver,the programming techniques develop with a relativlely slow speed,and the parallel problem of application programs is still outstanding.In this thesis,we research on the parallelism of specific applications based on GPU accelerator and CPU-GPU cluster,such as the parallel promotion of algorithm,the parallel implementation and optimization of algorithm,the simple performance predictive of large-scale CPU-GPU heterogenous system,static task assignment and communication optimization.The research work and innovations of our thesis is as follows:1.Based on the method of image fog removal using dark channel prior,we promote a series of improvement to solve the its defects,such as bad effects on some specific distriction,too much memory overheads and high computational complexity.We also implement and optimize the parallel program of the improved dehazing algorthm on GPU.We promote new method to recognize similar atmosphere light districts in the picture,and improve its effect by maintain its primary pixel value.We introduce the guided filter method to reduce the memory overheads.we also promote multi-level block method and new method of selecting atmosphere light to eliminate the computational complexity.According to the specialties of the dehazing algorthm and GPU platform,we try to optimize the basic paralle implementation.We promote a new parallel method of integral image to increase the parallelsim of the algorithm module.We present a new method of selecting atmosphere light to increase the parallelsim of relative module.We optimize the kernal organization by merging the kernel to reduce the startup cost of the kernels and take good use of the shared memory to reduce the access to global memory.2.We propose a highly parallel and scalable motion estimation algorithm,named multi-level resolution motion estimation(MLRME for short),by combining the advantages of local full search and down sampling.By sub-sampling a video frame,it saves a large amount of computation.While using the local full search method,it can exploit massive parallelism and make full use of the powerful modern many-core accelerators,such as GPU and Intel Xeon Phi.We implanted the proposed MLRME into HM12.0,and the experimental results showed that the encoding quality of the MLRME method is close to that of the fast motion estimation in HEVC,which declines by less than 1.5%.We also implemented the MLRME with CUDA,which obtained 30-60× speed-up than the serial algorithm on single CPU.Specifically,the parallel implementation of MLRME on a GTX460 GPU can meet the real-time coding requirement with about 25 fps for the 2560×1600video format.While for 832×480,the performance is more than 100 fps.3.Multicore CPUs can be combined with GPUs to perform coputations over 3D unstructured meshes on heterogeneous CPU-GPU clusters.we explain how to unlock the CPUs' computing power without slowing down other tasks related to data movement.By solving the representative diffusion equation using the cell-centred finite volume method,we demonstrate that combining the computing capacity of CPUs and GPUs delivers a performance advantage over the GPU-only approach.We also propose some improvement strategies to overcome the bottleneck from process of MPI separator.(1)Combine GPU to compute the MPI separator,and reduce the overhead of computing MPI separator.Therefore,MPI communication can be started as early as possible.(2)Use multiple MPI separators to further shorten the time of computation and copying from GPU to CPU of MPI separator.we divide the total MPI separator into multiple parts,which are controlled by different CUDA streams.In this way,we can overlap the computation time and copying time of different parts of MPI separator,which can further shorten the total processing time of MPI separator.(3)Use multiple threads to control the MPI communication Since the performance of unstructured mesh is limited by memory bandwidth,and we find the computation threads redundant for best performance.Therefore,we can use more threads to deal with MPI communication rather than only one while the best computing performance can be achieved.(4)Copy remote data back to GPU with pipeline The total time of copying remote data back to GPU are affected by two parts,how to receive the message and when to begin copying it to GPU.
Keywords/Search Tags:GPU, Parallel Programming Optimization, High-Performance Implementation
PDF Full Text Request
Related items