A scalable instruction queue design for exploiting parallelism

Posted on:2005-05-29

Degree:Ph.D

Type:Dissertation

University:University of Michigan

Candidate:Raasch, Steven Earl

Full Text:PDF

GTID:1458390008491942

Subject:Computer Science

Abstract/Summary:

To maximize the performance of wide-issue superscalar out-of-order microprocessors, the issue stage must be able to extract as much instruction-level parallelism (ILP) as possible from the dynamic instruction stream. This dissertation examines several approaches to increasing available ILP while minimizing the impact on cycle time.; First, I describe and evaluate a novel instruction queue design (the Segmented Instruction Queue) that eliminates the correspondence between IQ size and cycle time. The 512-entry Segmented IQ achieves between 58% and 98% of the performance of similarly-sized idealized instruction queue of conventional design though the latency of the latter is approximately 256 times larger. The Segmented IQ can be used as a component of a clustered architecture, another approach to reducing cycle-time penalties in wide-issue machines. The dependence tracking mechanism used by the Segmented IQ can be applied to the problem of instruction placement in clustered architectures.; By changing the mix of instructions present in the IQ, simultaneous multithreading (SMT) can also be used to increase the amount of available ILP. Under SMT, partitioning schemes are needed to distribute resource among threads; however, some of these schemes, clustered architectures in particular, can significantly reduce SMT workload performance. If an SMT machine is to use a clustered microarchitecture, the choice of instruction placement policy must be carefully evaluated to avoid performance degradation. Experiments show that naively allocating clusters to individual threads, eliminating the dynamic sharing that is the core of SMT, can reduce workload performance on a four-cluster architecture by as much as 26% versus a simple load-balancing scheme. This dissertation presents data that characterizes the performance of SMT workloads in clustered architectures using both conventional instruction queues and segmented instruction queues.; Individually, these mechanisms represent viable approaches to increasing available ILP. When the Segmented IQ is used in an SMT processor design, workload performance achieves an average of 80% and 86% of the idealized performance for two- and four-thread workloads, respectively, indicating that these approaches can be combined to form an effective approach to increasing processor utilization and performance.

Keywords/Search Tags:

Instruction, Performance, Segmented IQ, Available ILP, SMT

Related items

1	Instruction-flow Scheduling Mechanism For High-performance SIMD DSP
2	Optimization And Design Of Instruction Pipeline Of YHFT-DX High Performance DSP
3	Design And Verification Of Instruction Fetch And Dispatch Unit In Hign Performance DSP
4	Research On The Key Techniques Of Application-Specific Instruction-Set Processors
5	The Research And Design Of High Performance BWDSP Processor Instruction Cache
6	Large-scale in-situ topological analysis using segmented merge trees: Performance, scalability, and power efficienc
7	The Design And Realization Of The Front-End Instruction Fetching Component In 64 Bits High-performance Microprocessor
8	Analysis of segmented reflector antenna for a large millimeter wave radio telescope
9	Design And Implementation Of Multi-thread Processor’s Instruction Dual-issue Structure
10	Research And Implementation Of Instruction Decoder Verification And Vectorization Compiling For BWDSP