Micro-architectural support for improving synchronization and efficiency of simd execution on gpus

Posted on:2014-10-01

Degree:Ph.D

Type:Dissertation

University:Northeastern University

Candidate:Yilmazer, Ayse

Full Text:PDF

GTID:1458390008456049

Subject:Engineering

Abstract/Summary:

GPUs dedicate a majority of their transistor budgets to compute units rather than control logic. As a result, they can achieve excellent data-parallel power/performance. Given the continual demands for performance and power eefficiency, GPUs have become today's compute accelerators for many application domains. The general purpose community has been focusing on developing strategies to move a broader class of applications to these powerful devices. The underlying GPU architecture has been adapted to run a limited class of general purpose computations present across a range of applications. Many applications have already been ported to GPU platforms to take advantage of the potential data-parallel performance that GPUs afford. But there still remain barriers to migrating a broader class of applications onto GPUs. Being originally designed to run 3-D graphics, GPUs are highly optimized for graphics workloads. Graphics workloads possess a high degree of uniformity in their execution. Therefore, GPU architectures are optimized for efficient uniform execution. GPUs achieve high performance with data-parallel applications possessing regular control flow (i.e., predictable loops) and data access patterns that can effectively exploit high o-chip memory bandwidth. However, many general-purpose real world applications differ from graphics workloads - they come with large input sets exhibiting irregular access and synchronization patterns, and they possess varying computational granularity and irregular control flow. The current requirements for uniformity and predictability present barriers to moving a broader range of applications to GPUs. We believe if GPUs are going to become a mainstream computing device that it is necessary to relax some of these constraints. Only then can a wider variety of applications exploit the computational power of GPUs. One critical barrier present in non-uniform data-parallel applications is the need to synchronize between threads. Fine-grained synchronization is needed to support shared data access, especially when faced with irregular access and communication patterns. This dissertation presents a new approach to enhance the efficiency and scalability of GPU synchronization. The proposed scheme can enable applications that work on shared data to effectively communicate at finer levels of granularity. To achieve this ambitious goal, we propose a new synchronization approach called Hierarchical Queuing Locks (HQL). HQL is a novel hardware-based synchronization mechanism which provides efficient use of resources through execution blocking and hierarchical queuing. To provide a queue-based locking mechanism, HQL extends current GPU L1 and L2 cache management protocols by adding a synchronization protocol. Integration of HQL's synchronization protocol simplies the synchronization, but adds a level of complexity to the cache management protocol. Given this added complexity to the cache management scheme, as part of this dissertation we provide a formal verication of the proposed HQL synchronization protocol. To evaluate the benets of HQL, we start with studying a set of micro-benchmarks that represent highly irregular applications that require frequent synchronization. We additionally evaluate macro-benchmarks that utilize synchronization. We report on both the performance benefits and the savings in terms of instructions executed. Building upon the efficient fine-grained synchronization support provided for by HQL, we explore Scalar Waving (SW) and Simultaneous Scalar and SIMD group Waving (SSSW) architectures to further improve efficiency of SIMD execution on GPUs. These two mechanisms attempt to reduce the amount of redundant computations performed by the threads in a SIMD group. SW and SSSW improve SIMD efficiency for both irregular and regular applications. We motivate this work by reporting on the percent of redundant computations present in a range of workloads. We then quantitatively evaluate the benefits of SW and SSSW architectures using programs taken from four different benchmark suites. The impact of this dissertation design architectural features that can make the benets of GPU computing available to a much wider range of applications. These kind of enhancements can only further accelerate the adoption of GPUs as a rst-class computing device.

Keywords/Search Tags:

GPU, Gpus, Synchronization, SIMD, Applications, Execution, HQL, Efficiency

Related items

1	Exploiting Parallelism in GPUs
2	Optimizing Throughput and Power Consumption of Graphics Processing Units (GPUs)
3	Automatic transformation and optimization of applications on GPUs and GPU clusters
4	Research On Auto-Vectorization Compiling Techniques Oriented To Irregular Applications On SIMD Extension
5	Architectural and Runtime Enhancements for Dynamically Controlled Multi-Level Concurrency on GPUs
6	Research On Program Execution Model Based On Runtime
7	Efficiently and Transparently Maintaining High SIMD Occupancy in the Presence of Wavefront Irregularit
8	Research On SIMD Vectorization And Optimization Of Non-Multimedia Applications
9	The Research Of Improving The Execution Efficiency Of Java Code In Embedded Device
10	ILP-SIMD: An instruction parallel SIMD architecture with short -wire interconnects