Font Size: a A A

Efficient mapping of fast Fourier transform on the Cyclops-64 multithreaded architecture

Posted on:2008-10-04Degree:M.SType:Thesis
University:University of DelawareCandidate:Xue, LipingFull Text:PDF
GTID:2448390005974183Subject:Engineering
Abstract/Summary:
The emerging multi-core architectures unveil opportunities of massive on-chip parallelism through hardware support, but also present great challenges to application developers and system software designers. In this paper, we report our experience of optimizing the Fast Fourier Transform (FFT) on IBM Cyclops-64(C64) architecture, a novel multi-core architecture consisting of 160 threads, explicit memory-hierarchy and interconnection network to provide massive on-chip parallelism. C64 does not have data cache and thus cannot take advantage of cache-oblivious algorithm. In addition, current implementation of cache-oblivious method is entirely based on cache memory hierarchy that does not lend itself to the construction of an accurate performance model - which is critical in the performance optimization of data movement through a Cyclops-64 style explicit memory hierarchy. Therefore, to make a cache-oblivious FFT working efficiently on C64 is probably non-trivial.; This thesis takes a different path. The main contribution of this thesis includes: (1) an iterative search approach has been proposed and implemented for the C64 architecture taking advantage of the explicit memory hierarchy. Our approach fully exploits the opportunity provided by explicitly-addressable on-chip memory hierarchy (without caches) and constructs an accurate/deterministic performance model analytically. This model is used to rapidly calculate the performance of different "FFT computation plans" iteratively. Such performance numbers will be productively used by our search based optimization procedure; (2) a new technique for optimizing the scratch-pad memory space utilization has been proposed that can judiciously explore live-range splitting methods and a significant performance gain has been achieved as evidenced by our experiments; (3) an implementation of the proposed methods has been implemented on the C64 software and simulation toolchain, and a detailed scalability study of FFT on C64 architecture has been conducted. Furthermore, an in-depth analysis has been provided to illustrate the choice of optimal (vs. maximum) number of threads under each situation based on tradeoffs between the computation power and the synchronization overhead. The experimental results have demonstrated up to 25.5% speedup over a best known non-search based method.
Keywords/Search Tags:Architecture, C64, Memory hierarchy, Cyclops-64, FFT
Related items