Font Size: a A A

Toward Efficient SIMT Execution---A Microarchitecture Perspective

Posted on:2015-05-21Degree:Ph.DType:Dissertation
University:North Carolina State UniversityCandidate:Xiang, PingFull Text:PDF
GTID:1478390020952320Subject:Computer Engineering
Abstract/Summary:
The design philosophy of many-core architectures such as graphics processing units (GPUs) is to exploit thread-level parallelism (TLP) to achieve high throughput. Compared to central processing unit (CPU) designs, GPU-like many-core architectures spend the on-die area mainly for computation rather than complex instruction processing, and therefore is more energy efficient.;In this dissertation, we identify several inefficiencies of current GPU design and proposal architectural designs for higher performance and better energy efficiency. First, I will first present our study on eliminating the computational redundancies within GPGPU. According to our study, there exists significant computational redundancy in SIMD execution, where different execution lanes operate on the same operand values. And besides redundancy within a uniform vector, different vectors can also have the identical values. Therefore, we propose detailed architecture designs to exploit both types of redundancy for performance improvements and energy savings. For redundancy within a uniform vector, we propose to either extend the vector register file with token bits or add a separate small scalar register file to eliminate redundant computations as well as redundant data storage. For redundancy across different uniform vectors, we adopt instruction reuse, proposed originally for CPU architectures, to detect and eliminate redundancy. The elimination of redundant computations and data storage leads to both significant energy savings and performance improvement. Furthermore, we propose to leverage such redundancy to protect arithmetic-logic units (ALUs) and register files against hardware errors.;Secondly, I will present the novel resource management scheme for GPGPUs. In this study, we observe that the currently used TB-level resource management inside GPGPU can severely affect the TLP that may be achieved in the hardware. Since different warps in a TB may finish at different times. Due to TB-level resource management, the resources allocated to early finished warps are essentially wasted as they need to wait for the longest running warp in the same TB to finish. What is more, TB-level management can lead to resource fragmentation as well. To overcome these inefficiencies, we propose to allocate and release resources at the warp level. Warps are dispatched to an SM as long as it has sufficient resource for a warp rather than a TB. Furthermore, whenever a warp is completed, its resource is released and can accommodate a new warp. This way, we effectively increase the number of active warps without actually increasing the size of critical resources.;Finally, I will present our study on the impact of ILP enhancing techniques on GPGPU. In this study, we show that these ILP techniques can greatly reduce the performance dependency on TLP. This is especially useful for applications, whose resource usage limits the hardware to run a sufficient number of threads concurrently. In such cases, the ILP techniques can deliver significant performance gains at modest hardware costs. Based on this workload-dependent behavior, we then propose heterogeneous architecture for GPU computing. In our proposed heterogeneous GPU architecture, there are two types of in-order shader cores, one customized for applications with limited TLP due to their resource usage and the other customized for applications with sufficient TLP. This way, applications can be scheduled to either core based on their resource requirements and characteristics for better performance and energy efficiency.
Keywords/Search Tags:TLP, Architecture, Resource, GPU, Performance, Energy, Applications
Related items