Finite-Difference Time-Domain method (FDTD), which was first proposed by K.S.Yee in 1966, is a kind of simple and reliable method which is widely used for the calculating of electromagnetic fields. In resent decade, this method developed continually and has attracted more and more attention because of its features. However, in order to gain a correct solution FDTD must comply with its stability conditions, large quantity of meshes are required to simulate electrically-large-size or complicated-structure object. So with the increase of the frequency and dimension of object, a single PC can hardly satisfy the requirement of larger memory and longer time. For above reasons, by dividing the whole space of the FDTD into lots of sub-spaces and running in a parallel computer system, we could run the FDTD in parallel. In this way, the huge CPU time and huge memory requirements of the FDTD could be sharply decreased. Therefore this parallel FDTD is very effective to simulate electric-large objects.Graphics Processing Unit(GPU) has the highly processing power, parallelism and programm ability, and as a result, various applications associated with computer graphics advance greatly. Compute Unified Device Architecture (CUDA) is a fairly new technology from NVIDIA to program inexpensive multi-threaded GPUs. At the heart of CUDA is the primary ability for most programmers to keep thousands of threads busy. The current generation of NVIDIA GPUs could effectively support a very large number of threads. As a result, they can deliver one to two orders of magnitude performance increase in application performance. These graphics processors are widely available to anyone at almost any price point. The FDTD algorithm is well suited for parallel processing across spatial domain, thus it is a good candidate for execution on a GPU. We examine how to improve FDTD method by using the massively parallel architecture of GPGPU cards.In this dissertation, aiming at those problems in FDTD algorithm, we examine how to improve FDTD method based on GPU by using theoretical analysis and numerical simulation. In addition, optimizing the performance of CUDA applications often involves optimizing data accesses which includes the appropriate use of the various CUDA memory spaces, such as shared memory, constant memory, and registers. Each of these memory spaces has certain performance characteristics and restrictions. Local and global memories are not cached and their access latencies are high. Data written to shared memory within a block is accessible to all other threads within that block, but it is not accessible to a thread from any other block. Shared memory with these characteristics can be implemented very efficiently in hardware which translates to fast memory accesses. Consider a typical CUDA template as:1)Split a task into subtasks 2)Divide input data into chunks that fit into registers and shared memory 3)Load a data chunk from global memory into registers and shared memory 4)Each data chunk is processed by a thread block 5)Copy results back to global memory. Therefore one of the most important performance challenges facing CUDA developers is the best use of local multiprocessor memory resources. Appropriate use of these memory spaces can have significant performance implications for CUDA applications.The algorithm to be accelerated has been designed and implemented for the CUDA-compatible NVIDIA GeForce 9800 GT GPGPU, featuring 112 streaming processors and the total device memory of 512 MB. We examine how the computation time of FDTD can be reduced by using the parallel architecture of GPU cards. The paper is organized as follows:In Sectionsâ…¡, the theoretical background for the FDTD algorithm is given. In Sections III, the current state of the art in GPGPU is presented. In Section IV, an overview of the CUDA architectural model for GPGPUs is provided. In Section V, the architectural decisions that were made during design and optimization of FDTD algorithm for CUDA are discussed. In Sectionâ…¥, numerical experiments have been carried out, and the results demonstrate that the algorithm performed high-speed FDTD simulation using GPU with CUDA will decrease computation time significantly.It will be shown that a high level of GPU FDTD parallelization can be achieved if fine grain parallel computing approach is applied. Consequently, the GPU-based parallel FDTD has already been widely studied. |