Font Size: a A A

Towards High-Performance and Energy-Efficient Memory Hierarchy in Current and Future Chip Multi Processors

Posted on:2013-03-25Degree:Ph.DType:Dissertation
University:North Carolina State UniversityCandidate:Samih, AhmadFull Text:PDF
GTID:1458390008970512Subject:Engineering
Abstract/Summary:
Chip MultiProcessors (CMPs) are becoming the de facto hardware architecture over a range of computing platforms. According to Moore’s law, the number of cores in CMPs is expected to keep growing as transistor density continues to shrink. As the number of cores increases, the complexity and trade-offs of current CMP design shift towards the uncore part of the chip. Two major components of the uncore subsystem are the Last Level Cache (LLC) and the interconnect.;As for the LLC, it is still an open question of whether it should be private to each core or shared by all cores. A physically shared LLC allows applications to naturally divide-up and share the aggregate cache space. However, a large cache has a high access latency. On the other hand, private per-core LLCs provide low latency accesses to the corresponding core and allow a more scalable multicore configuration. However, CMPs with private LLCs suffer from a cache fragmentation problem; some caches may be over-utilized while others may be under-utilized. To avoid such fragmentation, researchers proposed capacity sharing mechanisms where applications that need additional cache space can place their victim blocks in remote caches. However, we found that allowing victim blocks to be placed on remote caches without considering the temporal locality of the blocks tends to cause a high number of remote cache hits relative to local cache hits. This in turn results in a sub-optimal capacity sharing performance.;Moreover, the rising number of on-chip core counts in CMPs has mandated more scalable interconnects such as Mesh and Torus, which consume an increasing fraction of the total chip power. As technology and operating voltage scale down, the static power consumes a larger fraction of the total power; reducing it is increasingly important for energy proportional computing. Currently, processor designers strive to send under-utilized cores into deep sleep states in order to reduce idling power and improve overall energy efficiency and energy proportionality. However, even in state-of-the-art CMP designs, the interconnect is always fully active regardless of the number of active cores, thus impacting energy proportionality.;In this dissertation, we address the two problems highlighted above to improve the performance and energy efficiency of the uncore subsystem in current and future CMPs..;First, we show that, in CMPs with private LLCs, current capacity sharing techniques cause a high number of remote LLC hits relative to local LLC hits. We also show that many of such remote cache hits can be converted into local cache hits if we allow newly fetched blocks to be selectively placed directly in a remote cache, rather than in the local cache. To demonstrate this, we use future trace information to estimate the near-upperbound performance that can be gained from combined placement and replacement decisions in capacity sharing. Further, we design a simple, predictor-based, scheme called Adaptive Placement Policy (APP) that learns from past cache behavior to make a better decision on whether to place a newly fetched block in the local or remote cache. We found that across several multi-programmed workload mixes running on a 4-core CMP, APP’s capacity sharing mechanism increases aggregate performance by 29% on average.;Second, to reduce interconnect idling power and achieve better energy proportionality, we propose Router Parking an approach to selectively power-gate routers attached to sleeping cores. Router Parking ensures that network connectivity is maintained, and limits the average interconnect latency impact of packet detouring around parked routers. We present two Router Parking algorithms – an aggressive approach to park as many routers as possible, and a conservative approach that parks a limited set of routers in order to keep the impact on latency increase minimal. Further, we propose an adaptive policy to choose between the two algorithms at run-time. Evaluations conducted on synthetic and real workloads show that Router Parking can achieve significant savings in the total interconnect energy of up to 41%.
Keywords/Search Tags:Energy, Router parking, Chip, Cache, Current, CMP, Performance, Cmps
Related items