Cache Optimizations And Parallel Simulation For Multi-threaded Workloads

Posted on:2013-07-10

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Y X Tang

Full Text:PDF

GTID:1228330377451836

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

Chip Multi-Processors (CMPs) has emerged as the mainstream architecture of high performance microprocessor due to the fact that it is a more scalable and cost-effective design alternative with lower design complexity when compared to single core design. Improve cache hit ratio is very important because the design complexity and performance bottleneck of CMP falls on the cache system,and chip-level cache system has become one of the research hot area of the multi-core processors.Under multiprogrammed environment,the traditional cache optimization mechanisms can improve the overall erformance. However, for the multithreaded applications, whether the these mechanisms could really improve the performance and how to improve the performance is still an open problem. This thesis focus is on cache-access simulation and optimizations of cache systems. The contributions of the thesis include:1. Research on cache optimization on tiled-CMP. With growth of on-chip communication delays and working sets of commercial and scientific workloads, L2caches of Chip Multiprocessors are subject to heave pressure. Basically, there are two kinds of designs for L2cache. First, using shared L2cache to maximize the aggregate cache capacity and minimize off-chip memory requests. Second, using private L2cache to minimize delays on cache access time. Our experiments with tiled architecture show that communication traffic of each tile is imbalance and, utilization of each L2cache is significant different. Based on this observation, we propose a novel adaptive replication policy (ARP) based on tiled shared caches, a mechanism that regularly checks workload behavior to control replication. ARP replicates cache blocks only when the benefit of replication is larger than the cost. Simulations of16-core CMPs shows that ARP provides better performance about communication traffic, average access distance and utilization ratio of aggregate L2caches.2. Research on utility based cache optimization for multi-threaded workloads. Under the pressure of high resource demand, the commonly used LRU policy may results in interferences among threads and degrades the overall performance. Partitioning the shared cache is a relatively flexible resource allocates method, but most previous partition approach aimed at multi-programmed workloads it ignores the difference of shared and private data access patterns of multi-threaded workloads, lead to utility decrease of the shared data. We studied the access characteristics of private and shared data in multi-thread workloads, and proposed a Utility-based Pseudo Partition cache partitioning mechanism (UPP). UPP dynamically collects utility information of each thread and shared data, and takes the overall marginal utility as the metric of cache partitioning. Besides, UPP exploit both frequency and recency information of a workload simultaneously, in order to evict dead cachelines early and filter less reused blocks through dynamic insertion and promotion mechanism.3. Research on application of multi-threading technology of the simulation of cache access in CMPs. It is a very popular approach to use simulators to evaluate the performance and cost of different CMP to determine the best designs and configurations. However, the simulate complexity is increasing due to the incease of processor number in CMPs.Most of the traditional simulators are single thread, which is a computational bottleneck of the simulators,in this paper,we design and implement a parallel simulator module ParaNSim. This simulator module can effectively reduce the simulation time for large scale cache simulation and supports large-scale NOC or CMP simulation. Our experiment shows the speedup of parallel simulation can improve simulation speed significantly.

Keywords/Search Tags:

multicore, cache simulation, parallel simulation, cache hierarchy, cachereplication, partitioning, replace, dynamic set sampling

PDF Full Text Request

Related items

1	Adaptive Cache Management Policies For High Performance Microprocessors
2	A Design And Implementation Of Inter-Thread Cache Interference Elimination Structure Based On Cache Partitioning
3	An adaptive chip multiprocessor cache hierarchy
4	Cache Partitioning Policies On Chip Multi-processors For Scientific Applications
5	Multi-dimension And Multi-level Associated Cache PartitioningMechanism In CMP
6	Study On Cache Partition Optimization Based On Non-stacked Cache Replacement Algorithm
7	Research On Memory Simulation And Optimizations In CMPs
8	Dynamic Cache Partitioning Method Based On Program Phase Behavior
9	Classification-based Prefetch-Aware Cache Partition Mechanism
10	The Application Of Cache Dynamic Tuning Method In Riad System