Font Size: a A A

Exploiting multi-threaded application characteristics to optimize performance and power of chip-multiprocessors

Posted on:2006-11-13Degree:Ph.DType:Dissertation
University:The Pennsylvania State UniversityCandidate:Liu, ChunFull Text:PDF
GTID:1458390008959167Subject:Computer Science
Abstract/Summary:
Chip multiprocessors (CMPs) are becoming a popular way of exploiting ever-increasing number of on-chip transistors. Multi-threaded applications are set to better utilize the raw power that CMPs provide. But those applications exhibit load imbalance at various levels. From CMPs point of view, the location of data on the chip can play a critical role in the performance of these applications because of the growing on-chip storage capacities and the relative cost of wire delays. It is important to locate the data at the right place at the right time in the on-chip cache hierarchy. We study the load imbalance at the barrier, among cache requests from different cores, and between actively shared blocks and mostly privately accessed blocks. We propose techniques to exploit those imbalance to improve power and performance.; For the load imbalance at the barrier, we observe that the imbalance are quite predictable, thus we propose a novel technique for optimizing the power consumption of chip-multiprocessors (CMPs) using an integrated hardware-software mechanism. By using a high level synchronization construct, called the barrier, our technique tracks the idle times spent by a processor waiting for other processors to get to the same point in the program. Using this knowledge, the frequency of the processors can be modulated to reduce/eliminate these idle times, thus providing power savings without compromising on performance.; For the imbalanced cache demands from different cores, we notice that the possible imbalance between the L2 demands across the cores favors a shared L2 organization, while the interference between these demands can favor a private L2 organization. We propose a new architecture, called Shared Processor-Based Split L2, that captures the benefits of these two organizations, while avoiding many of their drawbacks.; We also study the demands on different blocks of the L2, namely actively shared blocks and mostly privately accessed blocks, and show that, while there are a considerable number of L2 accesses to shared data, the volume of this data is relatively low. Consequently, it is important to keep this shared data fairly close to all processor cores for both performance and power reasons. Motivated by this observation, we propose a small center cell cache residing in the middle of the processor cores which provides fast access to its contents. We demonstrate that this cache organization can considerably lower the number of block migrations between the L2 portions that are closer to each core, thus providing better performance. Combined with sequetial tag-data access, the power consumption of shared cache can be further reduced.
Keywords/Search Tags:Power, Performance, Processor, Shared, Cache, Data, Cmps
Related items