Resource management techniques for performance and energy efficiency in multithreaded processors

Posted on:2007-11-23

Degree:Ph.D

Type:Dissertation

University:State University of New York at Binghamton

Candidate:Sharkey, Joseph James

Full Text:PDF

GTID:1458390005983357

Subject:Computer Science

Abstract/Summary:

Microprocessor designers that once favored the aggressive extraction of deeper Instruction Level Parallelism (ILP) from a single application have recently diverted their attention to the architectures that harvest parallelism across multiple threads of control, or Thread-Level Parallelism (TLP). This shift in paradigm has come in the light of new design challenges such as larger wire delays, escalating complexities, increasing levels of power dissipation, and higher operating temperatures. One design paradigm that exploits TLP is Simultaneous Multithreading (SMT), where multiple threads of control execute together on a slightly enhanced superscalar processor core and share its key datapath resources. As the number of transistors available on a chip will continue to increase in future technologies, it is likely that a higher degree of multithreading will be supported within each processor core. Therefore, it is important to consider techniques for increasing the efficiency of SMT-enabled cores and it is precisely the goal of this dissertation to propose and investigate such solutions.;We begin by examining they key shared datapath resources, namely the issue queue (IQ) and register files (RF). For the IQ, we first propose instruction packing---a technique which opportunistically places two instructions into the same IQ entry provided that each of these instructions has at most one non-ready source operand at the time of dispatch. Instruction packing results in a 40% reduction in the IQ power and 26% reduction in the wakeup delay at the cost of only 0.6% performance for a 4-threaded SMT machine. We then take the ideas behind instruction packing one step further and propose the 2OP_BLOCK scheduler---a scheduling technique that completely disallows the dispatch of instructions with 2 non-ready sources, thus significantly simplifying the IQ logic. This mechanism works well for SMTs because it often allows the reuse of the same IQ entry multiple times for the instructions with no more than one non-ready source rather than tying up the entry with an instruction with 2 non-ready sources (which typically spend a longer time in the queue). The 2OP_BLOCK design applied to a 4-threaded SMT with a 32-entry scheduler provides a 33% increase in throughput and 27% improvement in fairness.;Our next technique addresses the bottleneck associated with another key shared resource of the SMT datapath---the Physical Register File (RF). We propose a novel mechanism for early deallocation of physical registers to increase the register file efficiency and provide higher performance for the same number of registers by exploiting two fundamental trends in multithread processor design: (a) increasing memory access latencies and, (b) relatively higher number of L2 cache misses due to cache sharing effects. Applied to a 4-threaded SMT machine with 256 integer and 256 floating point registers (for the combined 512 registers), our technique provides additional gains of 33% (25%) on top of the DCRA mechanism, 38% (26%) on top of the Hill-Climbing technique, and 51% (48%) on top of the ICOUNT fetching policy in terms of the throughput IPC (fairness metric). Our technique is unique in that it does not incur tag re-broadcasts, register re-mappings, associative searches, rename table modifications or register file checkpoints, does not require per register consumer counters and requires no additional storage within the datapath. Instead, it relies on a simple off-the-critical-path logic at the back end of the pipeline to identify the early deallocation opportunities and save the values of the early deallocated registers for precise state reconstruction.;Finally, we show that there are complex interactions between the shared and private per-thread resources in an SMT processor and that these interactions need to be fully considered to understand the nuances of SMT architectures and to realize the full performance potential of multithreading. We show that without such an understanding, unexpected phenomenon may occur. For example, an across-the-board increase in the size of the per-thread reorder buffers often decreases the instruction throughput on SMT due to the excessive pressure on the shared SMT resources such as the issue queue and the register file. We propose mechanisms and the underlying ROB organization to dynamically adapt the number of ROB entries allocated to threads only when such adaptations do not result in increased pressure on the shared datapath resources. Our studies show that such dynamic adaptation of the ROBs results in significant increases on top of the DCRA resource allocation policy in terms of both throughput (54% compared to similarly-sized static ROBs and 21% compared to the best-performing static configuration) and fairness (29% and 10% respectively). We also demonstrated that the performance of adaptive ROBs approaches that of the datapath with an infinite issue queue, thus completely eliminating the size-effects of ROB scaling on the shared issue queue and obviating the need for more complex ROB management mechanisms.

Keywords/Search Tags:

Processor, SMT, Issue queue, Technique, ROB, Performance, Instruction, Shared

Related items

1	Design And Optimization Of High-performance Issue Queue
2	The Design And Implement Of Instruction Decode&Control Unit In FT-C55LP
3	Instruction Issue Policy Of Embedded Processor Based On Resource Reuse
4	Implementation And Optimization Of High-Performance Floating-Point Unit In X Processor
5	Highly efficient multithreaded architecture
6	Design And Implementation Of Multi-thread Processor’s Instruction Dual-issue Structure
7	Power-efficient issue queue design
8	Research On High Performance Embedded Risc-Based Processor
9	The Research And Design Of High Performance BWDSP Processor Instruction Cache
10	Research And Design Of High Performance Digital Signal Processor