Exploring novel many-core architectures for scientific computing

Posted on:2011-02-23

Degree:Ph.D

Type:Dissertation

University:University of Delaware

Candidate:Chen, Long

Full Text:PDF

GTID:1448390002467443

Subject:Engineering

Abstract/Summary:

The rapid revolution in microprocessor chip architecture due to the many-core technology is presenting unprecedented challenges to the application developers as well as system software designers: how to best exploit the computation potential provided by such many-core architectures?;The scope of this dissertation is to study programming issues for many-core architectures, and the contributions of this dissertation are in two main areas.;Optimizing the Fast Fourier Transform for IBM Cyclops-64. To understand issues in designing and developing high-performance algorithms for many-core architectures, we use the fast Fourier transform (FFT) as a case study to investigate the above issues on the IBM Cyclops-64 many-core chip architecture. We analyze the optimization challenges and opportunities for FFT problems, and identify domain-specific features of the target problems and match them well with some key many-core architecture features. We quantitatively address the impacts of various optimization techniques and effectiveness of the target architecture. The resulting FFT implementations achieve excellent performance results in terms of both speedup and absolute performance. To assist the algorithm design and performance analysis, we present a model that estimates the performance of parallel FFT algorithms for an abstract many-core architecture. This abstract architecture captures generic features and parameters of several real many-core architectures; therefore the performance model is applicable for any architecture with similar features. We derive the performance model based on cost functions for three main components of an execution: the memory accesses, the computation, and the synchronization. The experimental results demonstrate that our model can predict the performance trend accurately, and therefore can provides valuable insights for designing and tuning FFT algorithms on many-core architectures.;Exploring Fine-grained Task-based Execution on Graphics Processing Unit-enabled Systems. Using many-core Graphics Processing Unit (GPU) is gaining popularity in scientific computing. However, the conventional data parallel GPU programming paradigms, e.g., NVIDIA CUDA, cannot satisfactorily address certain issues, such as load balancing, GPU resource utilization, overlapping fine-grained computation with communication, etc. The problem is exacerbated when trying to effectively exploit multiple GPUs concurrently, which are commonly available in many modern systems. Our solution to this problem is a fine-grained task-based execution framework for GPU-enabled systems. Our framework allows concurrent execution of fine-grained tasks on GPU-enabled systems. The granularity of task execution is finer than what is currently supported in CUDA; the execution of a task only requires a subset of the GPU hardware resources. Our framework provides means for solving the above issues and efficiently utilizing the computation power provided by the GPUs. We evaluate our approach using both micro-benchmarks and a molecular dynamics (MD) application that exhibits significant load imbalance. Experimental results with a single-GPU configuration show that our fine-grained task-based solution can utilize the hardware more efficiently than the CUDA scheduler for unbalanced workload. On multi-GPU systems, our solution achieves near-linear speedup, good dynamic load balance, and significant performance improvement over other techniques based on standard CUDA APIs.

Keywords/Search Tags:

Many-core, Architecture, Performance, CUDA, FFT, GPU

Related items

1	Research Of Several Technologies Under The Many-core System For Algorithm Optimization
2	Research On Optimized Programming For Heterogeneous Multi-core Platform
3	Optimizing for a many-core architecture without compromising ease-of-programming
4	CUDA Architecture-based High-performance Image Processing Program Design
5	Research And Implementation Of Transplant CUDA Program Based On Android
6	Research On Architecture Of Multi-core Processor For High-Density Computing
7	Research And Implementation Of The Smoothed Particle Hydrodynamics Algorithm Based On Multi-core Architecture
8	CNA:a Performance Optimization System For Multi-core NUMA Architecture In Virtualized Environment
9	The Research On Linux Scheduling Mechanism Under The Multi-Core Architecture
10	Debugging Of Multi-core Architecture Performance