Font Size: a A A

Research On Efficient And Large-scale CPU And GPU Heterogeneous Parallel Computing For CFD Applications

Posted on:2015-08-21Degree:DoctorType:Dissertation
Country:ChinaCandidate:W CaoFull Text:PDF
GTID:1228330479979520Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Having been widely applied in aerospace and other fields, Computational Fluid Dynamics(CFD) aims to find various flow phenomenons and laws by using the numerical method to solve the flow governing equations. As the geometric shapes and the physical models used in numerical simulation are becoming increasingly complex and the flow mechanism is getting more sophisticated, the computational complexity and scale of CFD applications grows at an unprecedented rate. Therefore, there is an urgent need to parallelize the CFD applications and make them run efficiently on supercomputers.Due to the rapid increase in performance and the significant improvements in programmability, Graphics Processing Units(GPU) are being used more and more in supercomputers(i.e., CPU+GPU). At the same time, the node of a supercomputer becomes heterogeneous, which requires programmers to utilize multiple programming models and thus brings us a great challenge when parallelizing and optimizing a CFD application and alike.In this thesis, we have investigated the key techniques for the large-scale CFD applications and the CPU/GPU heterogeneous architecture. Especially, we have focused on new cooperative programming frameworks, parallel algorithms improvement, performance optimizations, load balancing for CFD applications. Our work and the main contributions are as follows:(1) Based on the characteristics of the CFD applications, we have proposed a threelevel heterogeneous collaborative programming framework(TLCF) specialized for the large CPU/GPU heterogeneous system. By synthesizing MPI, Open MP and CUDA programming models, we give three instances of the TLCF framework: NOMP-TLCF, OMPAETLCF and MPIAE-TLCF. We have found that the NOMP-TLCF framework is more suitable for CFD application development on the large-scale heterogeneous parallel systems.(2) We have investigated the Lattice Boltzmann equation solvers(LBM) on heterogeneous parallel systems and proposed a hybrid solution. In Lattice Boltzmann, we construct grid cells for collision, migration and boundary condition. We first proposed an AD algorithm by using a GPU-friendly memory access pattern based on the conventional AS algorithm. Then, we compared three solutions: the basic LBM-base parallel solution(using CPUs only), the LBM-overlap parallel solution(overlapping communication with computation), and the LBM-hybrid parallel solution(collaborating CPUs and GPUs simultaneously). Our theoretical analysis and experimental results show that, compared with the AS algorithm, the AD algorithm can use more thread configurations and thus get better performance. The application can run 17×faster on a single GPU than that on a6-core CPU. Our experimental results on multiple nodes show that, compared to a single compute node, the LBM-hybrid parallel solution can obtain a 82.0% parallel efficiency on 128 compute nodes.(3) As another case study, we have investigated the GPU/CPU parallel algorithm of Navier-Stokes equations solvers on heterogeneous parallel systems. We first proposed a fine-grained GPU parallel algorithm based on the grid cells. We use redundant computation and kernel decomposition to eliminate data dependency of inviscid solution. Thereafter, we have presented a coarse-grained grid block algorithm based on NOMP-TLCF programming framework, where the stream and asynchronous execution model are used to overlap data transfers and GPU computation. According to computing power and storage capacity of different processors, we are able to increase the scale of simulation on a single node by means of out-of-core. Further, TCBO and TCBL transport policy are applied to decrease the overhead of data transfers between different nodes.Numerical experimental results verify the correctness of the algorithm on heterogeneous system. Compared with the performance on a dual 6-core CPUs, GPU can obtain about 1.85× the performance/price advantage. Strong and weak scalability test results show the power and the efficiency of our parallel solution on multi-nodes.(4) Further taking solving the multi-block structured grid based Navier-Stokes equations and sparse matrix vector multiplication as a case study, we investigated the CPU/GPU load balancing strategies on heterogeneous parallel systems at the coarse-grained level and the fine-grained level.In terms of coarse-grained load balancing, we take into account the computing performance difference of the different processing units and communication effect on performance. We propose static load balancing strategy based on a performance model. And then, to eliminate several assumptions in the performance model, we propose a dynamic load balancing strategy based on work-stealing and task-prefetching. Experimental results on Navier-Stokes solvers show that these two load balancing strategies can get better balance among processing units. In terms of fine-grained load balancing, we study the performance of GPU with different sparse matrix storage formats, when solving sparse matrix vector multiplication. We observed that when the number of non-zero elements in each row has a large difference, the threads within the same warp exhibit load imbalance.We build the ELLPACK-RP mixed storage format based on JAD and ELLPACK-R to balance the loads among GPU threads. Experimental results show that we can improve the performance by 40% when on-zero element in each row has a larger difference, compared with ELLPACK-R, ELLPACK-RP on NVIDIA GTX 280.
Keywords/Search Tags:CPU/GPU heterogeneous system, CFD, Lattice Boltzmann equation, Navier-Stokes equation, Parallel cooperative programming framework, Load balancing
PDF Full Text Request
Related items