Font Size: a A A

Research On Cache Data Management Mechanism In Chip Multi-processor For Latency Reduction

Posted on:2014-02-03Degree:DoctorType:Dissertation
Country:ChinaCandidate:A W HuangFull Text:PDF
GTID:1108330479479657Subject:Electronic Science and Technology
Abstract/Summary:PDF Full Text Request
Chip Multi-Processor(CMP) has emerged as a preponderant architecture with the continuous improvement of the semiconductor technology and the rapid development of the VLSI(Very Large Scale Integrated circuits) design, which provides the necessary conditions for the naissance of CMP and promotes its concept to be mature gradually. Nowadays, CMP has been widely used in many important areas such as commercial servers, high performance computing, desktop computers and embedded systems, because it has a potential advantage over the other processors in virtue of the powerful computing ability, lower design complexity and better scalability. However, the “memory wall” problem continues to deteriorate with the ever widening speed gap between the processor and memory, which has been a serious obstacle to the performance improvement of the modern CMP for a long time.As a critical component to bridge the processor-memory speed gap, on-chip cache plays an important role in the mitigation of the “memory wall” problem. It is essential to design the on-chip cache architecture and data management mechanism efficiently for the improvement of the system performance. The capacity of on-chip cache increases with the advance of the semiconductor technology, and modern microprocessors rely on the complex on-chip interconnection to satisfy the communication requirement. Additionally, most applications often present diverse behavior from the perspective of memory characteristics. Under such circumstances, it is difficult to make an efficient tradeoff between the low miss rate of shared cache and the low hit latency of private cache. CMP presents many new challenges at the large capacity cache design, which restricts the performance improvement of memory system severely. Aiming at the “memory wall” problem in microprocessor design, this dissertation analyzes the challenging issue confronted by the private, shared and hybrid cache architecture in CMP respectively, and investigates the cache latency reduction technology from the perspective of data management. The innovations of this paper are as follows:Firstly, a cache capacity sharing mechanism based on fine-grained pseudo-partition(CSFP) is proposed, which is aimed at the capacity miss problem confronted with the private cache in CMP. Each cache bank is equipped with a weighted saturation counter array, which is designed to collect and predict the memory demand diversity experienced by different threads at a fine granularity. The private region and shared region on each cache set is changed adaptively, and the partition decision is used to not only guide the replacement of the victim block, but also control the co-operation of spilling and receiving dynamically. An intelligent capacity sharing mechanism is adopted to correct the memory imbalance between different cores, which mitigates the capacity misses in CMP private cache effectively. A 16-core tiled CMP model is built based on Simics platform, which is a cycle-accurate architecture simulator. The CSFP mechanism is implemented based on the original private cache structure, and the performance is evaluated. The experiment results show that CSFP can reduce the capacity misses of private caches in CMP environment and improve the performance of the memory system significantly, and the execution time of all the benchmarks can be reduced by 8.57% averagely.Secondly, an interference miss isolation mechanism based on skewed mapping(IMI-SM) is proposed, which is used to mitigate the conflict miss phenomena if the cache is shared and occupied competitively by multiple threads running on different processor cores. An inter-thread or intra-thread interference miss may be incurred when the least-recently-used candidate victim block is evicted from the last level cache. The skewed mapping mechanism is triggered once an interference miss is predicted. By introducing the dedicated cache component and intra-bank pressure balance operation, the new incoming data fetched from the off-chip memory can be dynamically placed in the on-chip conflict isolation cache region or the cache set which is under light pressure, so the negative impact caused by interference misses in CMP shared cache is mitigated effectively. The experiment results show that the execution time of different programs can be averagely reduced by 7.35% approximately. IMI-SM can reduce the interference misses to a certain extent in the CMP shared cache, and the system performance is improved significantly with negligible hardware overhead.Thirdly, an enhanced victim replication mechanism(E-VR) is proposed to reduce the long hit latency in large distributed shared cache in tiled CMP. By introducing the victim filter and target detection, E-VR not only takes the sharing pattern and the write-read characteristic of the victim block into consideration, but also pays attention to the characteristic of intra-bank non-equilibrium distribution of the memory demand at a fine granularity. The performance benefit of victim replication can be improved significantly by reducing the probability of the costly replication operation and extending the range of the candidate target set to retain the victim block. The experiment results show that the execution time of all the tested benchmarks can be reduced by 6.97% averagely. E-VR can reduce the hit latency of on-chip cache efficiently, and avoid the obvious negative impact on the global hit rate of the shared cache. The memory efficiency is improved due to the better balance between the low hit latency and the low miss rate.Fourthly, an adaptive data management framework, F-RMR, is proposed based on the virtual shared regions(VSR) in the distributed cache of CMP, which integrates the data replacement, migration and replication policy into a unified system. In F-RMR, the replacement decision can be made adaptively with concern for the uniqueness of the candidate victim block in the target set in local bank. In addition, both the activity degree of the remote source block and the state of the victim candidate in local VSR are taken into account, so the blocks accessed by different processor cores can be migrated and replicated between different VSRs. The contradiction between the long hit latency and the capacity utility can be solved appropriately due to the coordination among the replacement, migration and replication, which results in the reduction of the average memory access time. The experiment results show that the average memory access time can be reduced by 7.59% approximately when the sharing degree is set to 4. Additionally, F-RMR performs well under different sharing degrees compared with traditional virtual partition mechanism, while the additional hardware overhead is negligible.
Keywords/Search Tags:Chip Multi-processor, Large Capacity Cache, Latency Reduction, Spilling, Mapping, Replication, Migration, Replacement
PDF Full Text Request
Related items