Font Size: a A A

Research On High Performance Parallel Algorithms Based On GPU

Posted on:2011-12-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:H T BaiFull Text:PDF
GTID:1118360305953674Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The rapid upgrade of Graphics Process Unit (GPU) not only brings along the advance of the related applied technology such as image process, virtual reality, and computer simulation; but also provides an operating platform for general-purpose computing except for graphics process.At the earlier stage, general-purpose computing based on GPU depends on such interfaces as OpenGL and Direct3D whose underlying architecture and functions gave only weak support to general-purpose computing. This restricted the application of GPU in general-purpose computing. However, at the end of 2006, NVIDIA, a famous graphic and visual computing company, released CUDA (Compute Unified Device Architecture), which enabled GPU to exert its advantage in parallel computing. CUDA unifies the traditional separated rendering process of vertex and pixel, avoids the load imbalance of the processor as well as the resources waste, and the processors of GPU also switch from vector pattern to scalar pattern. Moreover, as CUDA provides C-like language interfaces which do not need to be mapped to graphic API, computing work can be mapped efficiently to the hardware architecture. In addition, a multi-level hierarchical memory model is specially designed in CUDA to give support to general-purpose computing. Therefore, general-purpose computing based on GPU has become a very hot research topic in the field of high-performance computing. However, what are suitable to be computed by GPU? How to design, implement and optimize algorithms based on GPU under its special architecture? How much improvement in performance can parallel algorithm based on GPU achieve up to over serial algorithms? All these questions need urgently to be answered by science and engineering research.There are mainly three directions for the research on general-purpose computing based on GPU. They are: (1) GPU architecture; (2) general-purpose computing algorithms based on GPU; (3) software and software environment for general-purpose computing based on GPU.We select typical numerical algorithm with non-compute intensive characteristic, data-intensive algorithm, and meta-heuristic optimization algorithm to study the parallel algorithms based on GPU in this paper, exploring GPU-based large-scale parallel algorithms.Numerical non-compute intensive algorithm usually has the"memory wall"problem. In the parallel process on GPU, this problem becomes more serious, because GPU is inefficient in performing the memory-bottleneck algorithm. Spare-vector multiplication algorithm is the representative of such a problem. Therefore, we design a sparse-vector multiplication algorithm based on GPU, CUDA-SPMV; using compressed sparse row storage method to represent sparse matrix. The optimization strategies adopted in this paper are as follows. (1) The computing of one element of the results vector can be completed by Half-warp, taking advantage of the synchronization nature of the threads in Warp; (2) Round data to realize coalesced access; (3) Put input vector into texture memory to achieve data reuse; (4) Use page-locked memory to accelerate data transmission; (5) Employ shared memory to speed up data access. The following results have been found out through experiments. (1) This algorithm is more than 3 times faster than the CPU's serial execution version. (2) The resource utilization rate of GPU in this algorithm reaches the maximum in theory. This algorithm displays higher real floating-point performance and larger memory bandwidth, compared with GPU-based CUDPP and CSR-vector in SpMV library.Data-intensive algorithms are suitable for parallel computing based on GPU, due to their unique data partitioning feature. This paper studies the parallel methods and performance optimization strategies of data-intensive algorithms based on GPU, taking frequent pattern mining algorithm and k-nearest neighbor searching algorithm as examples.(1) First, a parallel frequent pattern mining algorithm based on GPU, GFPMA, is brought forward. This algorithm employs the breadth-first search process on CPU and partitions both candidate frequent itemsets and transaction itemsets on GPU so as to complete parallel support counting. Experiments on standard library FIMI indicate that GFPMA performs prominently in dense, long-pattern itemsets mining. It achieves up to two orders of magnitude improvement over the CPU serial version, up to one order of magnitude over tire-tree version and more than ten times faster than state-of-art FIMI01.(2) Second, a parallel spatial k-nearest neighbor searching (KNN) algorithm based on GPU, GKDAKNN, is studied, aiming at the ineffectiveness of traditional KNN. The new algorithm makes division of space based on KD-Tree algorithm on CPU and a pruning threshold is given, while parallel computing of ABT pruning and calculation of distance between spatial data objects are completed on GPU. Experiments on synthetic data claim that GKDAKNN performs prominently in spatial data, whose dimension is under 64. It achieves up to one order of magnitude improvement over the CPU"brute force"serial version, nearly two times faster than the newest and state-of-art GPU"brute force"parallel version depending on the data.Meta-heuristic optimization algorithm is one of the most efficient algorithms in large-scale optimization. Its search process for solution space shows intrinsic paralleling feature. The paper, choosing ant colony algorithm as a representative, finds out methods for parallel ant colony optimization (ACO) algorithm based on GPU and several of its improved version. (1) A tentative ACO algorithm, GFLACO, is brought forward to solve the problem that individual ant in the colony is not intelligent enough. According to the algorithm, individual ant has the ability to foresee its future routes, which prevent it from entering into low-quality route. Through the parallel of ants'circuit routes by GPU fine-grained strategy, the iterative process is accelerated. (2) As ant colony parameters are sensitive, the coevolution of several ant colonies is studied. A multi-colonies algorithm based on GPU that integrates coarse and fine granularity, GMMAS, is put forward. The algorithm simulates many sub-colonies with different parameters in a GPU execution model. The sub-colonies and ants in these colonies evolve independently. As a result, the ability of the algorithm to get higher solutions is enhanced. (3) A new parallel multi-ant-colony algorithm based on shared pheromone matrix, GMACs, is brought forward. The algorithm provides a new way for pheromone initialization and dynamic bound by taking integrated MMAS and ACS as examples. It has been proved theoretically that the algorithm gets value convergence and solution convergence. The results on TSPLIB show that the algorithm runs several times to dozens of times faster than original algorithm, and higher-quality solutions can be achieved under the condition of full convergence.The heterogeneous multi-core architecture consisted of CPU and GPU is currently the platform for large-scale parallel general purpose computing based on GPU. The ultimate goal for the parallel of algorithms based on GPU is to accomplish the given computing tasks within the shortest period of time and the allowable range of errors. General-purpose computing algorithms based on GPU need to be pre-evaluated in terms of data accuracy, latency, and computing quantity. In addition, in the design and optimization, computing methods such as tasks and data partition and thread mapping, optimization strategies such as memory access optimization, communication optimization and dictation optimization, should be adopted. In the design and optimization of general-purpose algorithms based on GPU, various factors should be taken into consideration such as the architecture of GPU. With trial and error, the desired performance can be achieved.Currently, general-purpose computing based on GPU is still not mature. Through studying the parallel methods of several traditional algorithms based on GPU, this paper summarizes a common model for general-purpose computing based on GPU in heterogeneous multi-core architecture, and brings forward a series of optimization strategies, so as to provide reference for the transplant of other algorithms into GPU architecture as well as give support theoretically and methodologically to the popularity of high performance computing (HPC) based on GPU.
Keywords/Search Tags:Graphics Processing Unit (GPU), Compute Unified Device Architecture (CUDA), spare-vector multiplication, frequent pattern mining, k-nearest neighbor searching algorithm, ant colony optimization
PDF Full Text Request
Related items