Toward Efficient SIMT Execution---A Microarchitecture Perspective

Posted on:2015-05-21

Degree:Ph.D

Type:Dissertation

University:North Carolina State University

Candidate:Xiang, Ping

Full Text:PDF

GTID:1478390020952320

Subject:Computer Engineering

Abstract/Summary:

The design philosophy of many-core architectures such as graphics processing units (GPUs) is to exploit thread-level parallelism (TLP) to achieve high throughput. Compared to central processing unit (CPU) designs, GPU-like many-core architectures spend the on-die area mainly for computation rather than complex instruction processing, and therefore is more energy efficient.;In this dissertation, we identify several inefficiencies of current GPU design and proposal architectural designs for higher performance and better energy efficiency. First, I will first present our study on eliminating the computational redundancies within GPGPU. According to our study, there exists significant computational redundancy in SIMD execution, where different execution lanes operate on the same operand values. And besides redundancy within a uniform vector, different vectors can also have the identical values. Therefore, we propose detailed architecture designs to exploit both types of redundancy for performance improvements and energy savings. For redundancy within a uniform vector, we propose to either extend the vector register file with token bits or add a separate small scalar register file to eliminate redundant computations as well as redundant data storage. For redundancy across different uniform vectors, we adopt instruction reuse, proposed originally for CPU architectures, to detect and eliminate redundancy. The elimination of redundant computations and data storage leads to both significant energy savings and performance improvement. Furthermore, we propose to leverage such redundancy to protect arithmetic-logic units (ALUs) and register files against hardware errors.;Secondly, I will present the novel resource management scheme for GPGPUs. In this study, we observe that the currently used TB-level resource management inside GPGPU can severely affect the TLP that may be achieved in the hardware. Since different warps in a TB may finish at different times. Due to TB-level resource management, the resources allocated to early finished warps are essentially wasted as they need to wait for the longest running warp in the same TB to finish. What is more, TB-level management can lead to resource fragmentation as well. To overcome these inefficiencies, we propose to allocate and release resources at the warp level. Warps are dispatched to an SM as long as it has sufficient resource for a warp rather than a TB. Furthermore, whenever a warp is completed, its resource is released and can accommodate a new warp. This way, we effectively increase the number of active warps without actually increasing the size of critical resources.;Finally, I will present our study on the impact of ILP enhancing techniques on GPGPU. In this study, we show that these ILP techniques can greatly reduce the performance dependency on TLP. This is especially useful for applications, whose resource usage limits the hardware to run a sufficient number of threads concurrently. In such cases, the ILP techniques can deliver significant performance gains at modest hardware costs. Based on this workload-dependent behavior, we then propose heterogeneous architecture for GPU computing. In our proposed heterogeneous GPU architecture, there are two types of in-order shader cores, one customized for applications with limited TLP due to their resource usage and the other customized for applications with sufficient TLP. This way, applications can be scheduled to either core based on their resource requirements and characteristics for better performance and energy efficiency.

Keywords/Search Tags:

TLP, Architecture, Resource, GPU, Performance, Energy, Applications

Related items

1	Co-adapting scientific applications and architectures toward energy-efficient high performance computing
2	The Design And Implementation Of Human Resource Performance Management System Based On The B/S Architecture
3	Resource Allocation And Outage Performance Analysis In SWIPT System With Non-linear Energy Harvesting Model
4	Resource allocation for guided parameter search applications on high-performance parallel computing environments
5	Explicit energy resource management as a first class operating system resource
6	Research On Server Energy Efficiency Benchmark Tools And Performance Evaluation
7	Research On Energy Consumption Model And Performance Of Embedded Software Based On Architecture Level
8	Energy-efficient Resource Management For Heterogeneous Networks With Hybrid Energy
9	Energy-efficient resource management for high -performance computing platforms
10	Research On Cloud Resource Management Technology Based On Serverless Architecture