Font Size: a A A

Research Of FIR Filtering Parallel Algorithm Implemented In Frequency Domain Based On CUDA

Posted on:2013-06-28Degree:MasterType:Thesis
Country:ChinaCandidate:Z ChenFull Text:PDF
GTID:2298330467964844Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
The rapid evolution of the Graphics Processor Unit (GPU) not only brings along the advance of the related applications such as virtual reality, computational simulation and image processing, but also extends its application to the outside of the general-purpose computing. Now there is a new trend to exploit the Computer Unified Device Architecture (CUDA) platform proposed by NVIDIA to implement high-performance parallel computing applications on GPU. More and more computation-intensive applications improve their performance dramatically by using efficient parallelized implementation on GPUThe fundamental building block, Finite Impulse Response (FIR) filter, has been widely used for the digital signal processing (DSP). In order to improve the performance of the FIR filter, the tap length for the FIR filter should be increased. This is a very typical computation-intensive application. Although the computational complex of the FIR filter implementation in frequency domain has been decreased significantly comparing to the implementation in time domain, it is still a big challenge for high order FIR filtering for streaming data of the high sample rate system.Based on the overlap-save method, a traditional FIR filtering in frequency domain, the thesis presents a high efficient parallelized overlap-save method which makes use of the new generation GPU architect. And this parallelized overlap-save method has been implemented on the NVIDIA GTX465by using CUDA technology. In order to maximizing the usage of the GPU global memory bandwidth, the parallelized overlap-save method adopts a new data partitioning method, which partitions the input data into the data chunks with the length equal to twice FFT size. This partitioning method can simplify the data arrangement for both input data and output results and eliminate the performance degradation of kernel function execution caused by the conditional divergence. Meanwhile, in order to make use of the memory coalescing for the GPU global memory, the parallelized overlap-save method optimizes the memory access for the threads in a warp by using the approach which the adjacent threads in a warp access to the adjacent data. Thus, this parallelized overlap-save method is very suitable for the new generation architecture by adopting these optimizations. In addition, this algorithm also utilizes the asynchronous data transferring method provided by the CUDA to overlap the data transferring time between host memory and GPU global memory with kernel execution time. Therefore, the kernel function computation and data transferring can be concurrent executed.The experimental results show the performance of the parallelized overlap-save method has been greatly improvement, comparing to the performance of the overlap-save method accelerated by using open source FFTW library on the Intel core i7. And the speedup ratio can achieve15.4.
Keywords/Search Tags:frequency-domain FIR filtering, GPU, parallel algorithm, CUDA
PDF Full Text Request
Related items