Font Size: a A A

Supporting Applications Involving Irregular Accesses and Recursive Control Flow on Emerging Parallel Environments

Posted on:2015-08-17Degree:Ph.DType:Dissertation
University:The Ohio State UniversityCandidate:Huo, XinFull Text:PDF
GTID:1478390017993487Subject:Computer Science
Abstract/Summary:
Parallel computing architectures have become ubiquitous nowadays. Particularly, Graphics Processing Unit (GPU), Intel Xeon Phi co-processor (many integrated core architecture), and heterogeneous coupled and decoupled CPU-GPU architectures have emerged as a very significant player in high performance computing due to their performance, cost, and energy efficiency. Our overall goal is to provide parallel strategies and runtime support for modern GPUs, Intel Xeon Phi, and heterogeneous architectures to accelerate the applications with different communication patterns.;We first study the parallel strategies for generalized reductions on modern GPUs. The traditional mechanism to parallelize generalized reductions on GPUs is referred to as full replication method, which assigns each thread an independent data copy to avoid data racing and maximize parallelism. However, it introduces significant memory and result combination overhead, and is inapplicable for a large number of threads. In order to reduce memory overhead, we propose a locking method, which allows all threads per block to share the same data copy, and supports fine-grained and coarse-grained locking access. The locking method succeeds in reducing memory overhead, but introduces thread competition.;Next, we investigate the strategies for irregular or unstructured applications. One of the key challenges is how to utilize the limited-sized shared memory on GPUs. We present a partitioning-based locking scheme, by choosing the appropriate partitioning space which can guarantee all the data can be put into the shared memory. Moreover, we propose an efficient partitioning method and data reordering mechanism.;Further, to better utilize the computing capability on both host and accelerator, we extend our irregular parallel scheme to the heterogeneous CPU-GPU environment by introducing a multi-level partitioning scheme and a dynamic task scheduling framework, which supports pipelining between partitioning and computation, and work stealing between CPU and GPU. Then, motivated by the emerging of the integrated CPU-GPU architecture and its new shared physical memory, we develop the thread block level scheduling framework for three communication patterns, generalized reduction, irregular reduction, and stencil computation. This novel scheduling framework introduces locking free access between CPU and GPU, reduces command launching and synchronization overhead by removing extra synchronizations, and improves load balance.;Although the support of unstructured control follow has been included in GPUs, the performance of recursive applications on modern GPUs is limited due to the current thread reconvergence method in the SIMT execution model. To efficiently port recursive applications to GPUs, we present our novel dynamic thread reconvergence method, which allows threads to reconverge either before or after the static reconvergence point, and improves instructions per cycle (IPC) significantly.;Intel Xeon Phi architecture is an emerging x86 based many-core coprocessor architecture with wide SIMD vectors. However, to fully utilize its potential, applications must be vectorized to leverage the wide SIMD lanes, in addition to effective large-scale shared memory parallelism. Compared to the SIMT execution model on GPUs with CUDA or OpenCL, SIMD parallelism with a SSE-like instruction set imposes many restrictions, and has generally not benefitted applications involving branches, irregular accesses, or even reductions in the past. We consider the problem of accelerating applications involving different communication patterns on Xeon Phis, with an emphasis on effectively using available SIMD parallelism. We offer an API for both shared memory and SIMD parallelization, and demonstrate its implementation. We use implementations of overloaded functions as a mechanism for providing SIMD code, which is assisted by runtime data reorganization and our methods to effectively manage control flow. (Abstract shortened by UMI.).
Keywords/Search Tags:Applications, Parallel, Intel xeon phi, Irregular, Method, SIMD, Data, Emerging
Related items