Hardware techniques to improve cache efficiency

Posted on:2010-07-02

Degree:Ph.D

Type:Dissertation

University:The University of Texas at Austin

Candidate:Liu, Haiming

Full Text:PDF

GTID:1448390002972303

Subject:Computer Science

Abstract/Summary:

Modern microprocessors devote a large portion of their chip area to caches in order to bridge the speed and bandwidth gap between the core and main memory. One known problem with caches is that they are usually used with low efficiency; only a small fraction of the cache stores data that will be used before getting evicted. As the focus of microprocessor design shifts towards achieving higher performance-perwatt, cache efficiency is becoming increasingly important. This dissertation proposes techniques to improve both data cache efficiency in general and instruction cache efficiency for Explicit Data Graph Execution (EDGE) architectures.;To improve the efficiency of data caches and L2 caches, dead blocks (blocks that will not be referenced again before their eviction from the cache) should be identified and evicted early. Prior schemes predict the death of a block immediately after it is accessed, based on the individual reference history of the block. Such schemes result in lower prediction accuracy and coverage. We delay the prediction to achieve better prediction accuracy and coverage. For the L1 cache, we propose a new class of dead-block prediction schemes that predict dead blocks based on cache bursts. A cache burst begins when a block moves into the MRU position and ends when it moves out of the MRU position. Cache burst history is more predictable than individual reference history and results in better dead-block prediction accuracy and coverage. Experiment results show that predicting the death of a block at the end of a burst gives the best tradeoff between timeliness and prediction accuracy/coverage. We also propose mechanisms to improve counting-based dead-block predictors, which work best at the L2 cache. These mechanisms handle reference-count variations, which cause problems for existing counting-based dead-block predictors. The new schemes can identify the majority of the dead blocks with approximately 90% or higher accuracy. The ability to accurately identify the majority of the dead blocks in the cache long before their eviction can lead to not only higher cache efficiency, but also reduced power consumption or higher reliability.;In this dissertation, we use the dead-block information to improve cache efficiency and performance by three techniques: replacement optimization, cache bypassing, and prefetching into dead blocks. Replacement optimization evicts blocks that become dead after several reuses, before they reach the LRU position. Cache bypassing identifies blocks that cause cache misses but will not be reused if they are written into the cache and do not store these blocks in the cache. Prefetching into dead blocks replaces dead blocks with prefetched blocks that are likely to be referenced in the future. Simulation results show that replacement optimization or bypassing improves performance by 5% and prefetching into dead blocks improves performance by 12% over the baseline prefetching scheme for the L1 cache and by 13% over the baseline prefetching scheme for the L2 cache. Each of these three techniques can turn part of the identified dead blocks into live blocks. As new techniques that can better utilize the space of the dead blocks are found, the dead-block information is likely to become more valuable.;To improve the instruction cache efficiency in EDGE architectures, we decouple the next-block prediction from the instruction fetch so that the next-block prediction can run ahead of instruction fetch and the predicted blocks can be prefetched into the instruction cache before they cause any I-cache misses. In particular, we discuss how to decouple the next-block prediction from the instruction fetch and how to control the run-ahead distance of the next-block predictor in a fully distributed microarchitecture. The performance benefit of such a look-ahead instruction prefetching scheme is then evaluated and the run-ahead distance that gives the best performance improvement is identified. In addition to prefetching, we also estimate the performance benefit of storing variable-sized blocks in the instruction cache. Such schemes reduce the inefficiency caused by storing NOPs in the I-cache and enable the I-cache to store more blocks with the same capacity. Simulation results show that look-ahead instruction prefetching and storing variable-sized blocks can improve the performance of the benchmarks that have high I-cache miss rates by 17% and 18% respectively, out of an ideal 30% performance improvement only achievable by a perfect I-cache. (Abstract shortened by UMI.)...

Keywords/Search Tags:

Cache, Improve, Dead blocks, Performance, Techniques, Instruction, Prediction

Related items

1	Research On Prediction Of Instruction Cache Behavior In Real-time Embedded System
2	Research On Low Power Techniques Of The Instruction Fetching Unit In Embedded Processors
3	The Research And Design Of High Performance BWDSP Processor Instruction Cache
4	Software Optimization Techniques For Small Instruction Caches
5	Leveraging Compressed Cache To Improve Disk Performance
6	Research And Design Of Instruction Cache In Cost-effective DSP
7	Key Techniques For Fast Instruction Set Simulator
8	Design Of Cache Based On16Bit High Performance CPU
9	Research And Design Of Instruction Cache On DSP
10	Research On The Key Techniques Of Application-Specific Instruction-Set Processors