Font Size: a A A

Microarchitecture for billion-transistor VLSI superscalar processors

Posted on:2003-11-01Degree:Ph.DType:Dissertation
University:Yale UniversityCandidate:Loh, Gabriel HsiuweiFull Text:PDF
GTID:1468390011984822Subject:Computer Science
Abstract/Summary:
The vast computational resources in billion-transistor VLSI microchips can continue to be used to build aggressively clocked uniprocessors for extracting large amounts of instruction level parallelism. This dissertation addresses the problems of implementing wide issue, out-of-order execution, superscalar processors capable of handling hundreds of in-flight instructions. The specific issues covered by this dissertation are the critical circuits that comprise the superscalar core, the increasing level-one data cache latency, the need for more accurate branch prediction to keep such a large processor busy, and the difficulty in quickly evaluating such complex processor designs.; Using scalable circuit designs, large instruction windows may be implemented with fast clock speeds. We design and optimize the critical circuits in a superscalar execution core. At comparable clock speeds, an instruction window implemented with our circuits can simultaneously wakeup and schedule 128 instructions, compared to only twenty instructions in the Alpha 21264.; Augmenting our processor with clustered, speculative Level Zero (L0) data caches provides fast accesses to the data cache despite the increasing distance across the core to the Level One cache. Large superscalar execution cores of future processors may take up so much area that a load from memory requires multiple cycles to propagate across the core, access the cache, and propagate the result back. Multiple L0 caches provide fast, one-cycle cache accesses at the cost that the value read from an L0 cache may occasionally be incorrect. An eight-cluster superscalar processor augmented with our L0 caches achieves an overall performance that is within 2% of an unimplementable processor that does not account for additional wire delay of propagating signals across the large execution core. We show how the L0 caches can boost the performance of large superscalar processors as well as a range of other possible design points.; Highly accurate prediction of conditional branches is necessary to maintain a steady flow of instructions to the execution core. We explore how to take advantage of the large transistor budget of future processors to build more accurate hardware branch prediction algorithms. In particular, we make use of results from the machine learning field in combining results from multiple predictions. At a 32KB hardware budget, our predictor outperforms the best previous published branch predictor with a 200KB budget. We also take an information theoretic approach to the analysis of existing branch prediction structures. Our results show that the average information content conveyed by the hysteresis bit of a saturating two-bit counter in an 8192-entry gshare predictor is only 1.11 bits. This motivates our shared split counter which shares some state between multiple counters, achieving an effective cost of less than 1.5 bits per counter. Using shared split counters instead of saturating two bit counters enables the implementation of smaller, and therefore faster, branch prediction structures.; As the size and complexity of processors increase, so does the difficulty of the computational task of evaluating potential processor designs. The final contribution of this dissertation is a critical-path based approach to estimating the performance of superscalar processors. Our technique uses a fast in-order functional processor simulator to provide a program trace. By applying a set of efficient time-stamping rules to the trace, we obtain an accurate estimate of the critical path of the program in less than half of the simulation time of a cycle-accurate simulator.
Keywords/Search Tags:Processor, Superscalar, L0 caches, Branch prediction, Accurate
Related items