Font Size: a A A

Thread criticality and TLB enhancement techniques for chip multiprocessors

Posted on:2011-10-19Degree:Ph.DType:Thesis
University:Princeton UniversityCandidate:Bhattacharjee, AbhishekFull Text:PDF
GTID:2448390002462430Subject:Engineering
Abstract/Summary:
Numerous technology trends including debilitating power densities and rising verification costs have recently prompted a shift to multicore or chip multiprocessor (CMP) architectures. Despite their benefits, CMPs face a number of design challenges. A key challenge is how best to architect the on-chip memory hierarchy, which plays a key role in determining system performance and power characteristics.This thesis presents a top-down analysis, from the application-level down to the microarchitectural layer, of the role of the on-chip memory hierarchy in determining the performance and power of emerging parallel workloads. Analysis shows that two primary sources of overhead in parallel program performance arise due to imperfections in the on-chip memory. The first is the variation in execution speeds that multiple threads of a parallel program experience. As this thesis will show, this difference in thread criticality results in performance and energy degradation. The second source of overhead arises from the fact that emerging parallel workloads tend to stress their Translation Lookaside Buffers (TLBs) significantly. As application working sets increase, we show that modern TLBs experience notable miss rates, resulting in performance overheads.Based on these observations, this thesis presents the first full-system characterization of the roles of thread criticality and TLB behavior in determining system performance. Using a combination of real-system profiling, full-system simulation, and FPGA-based emulation techniques, this thesis characterizes the causes of thread criticality and increasing TLB pressure. First, this work shows that cache misses are the primary cause of differing thread speeds. Specifically, threads that experience a greater number of cache misses run slower than their better-cached counterparts. Using this simple but powerful intuition, this thesis proposes thread criticality predictors with 93% accuracy. This thesis will also explore the usefulness of these criticality predictors for various resource management techniques on CMPs. Second, this work then characterizes the prevalence of TLB misses, showing that while parallel workloads experience high TLB miss rates, 30% to 95% of them can be classified as predictable. This predictability arises in two ways. First, multiple cores often TLB miss on the same translation. Second, cores often TLB miss on entries with virtual pages placed a predictable stride from one another.This thesis then builds upon our workload characterization by proposing techniques to improve the on-chip memory hierarchy. First, I show how cache-based thread criticality prediction can improve parallel program performance by off-loading work from critical to non-critical threads. Specifically, Intel TBB's task stealing mechanism is augmented with criticality prediction to yield 21% average performance improvements. Second, this thesis shows that by estimating which threads are non-critical and by how much, critical threads may be run at a high clock rate while the others are slowed down, achieving 15% average energy savings. While this thesis focuses on these specific applications, we discuss the versatility of thread criticality prediction and how it may be applied in additional scenarios.This thesis then uses the TLB characterization to propose TLB enhancement techniques. By leveraging the classes of predictable TLB misses, we propose and evaluate two techniques that use inter-core cooperation to eliminate TLB misses. First, I show the benefits of Inter-Core Cooperative (ICC) prefetching schemes, in which Leader-Follower prefetching exploits TLB misses experienced by multiple cores while Distance-based Cross-Core prefetching captures the presence of regular inter-core strides. Combining these approaches, ICC prefetching techniques can eliminate 19% to 90% of system misses. I then propose an alternative to ICC prefetching, Shared Last-Level (SLL) TLBs, which eliminate 7% to 79% of system TLB misses.Overall, this thesis is the first to show the importance of thread criticality and TLB enhancement techniques for parallel programs on CMPs. Moreover, as CMPs experience increased core counts, heterogeneity, and application memory footprints increase, these techniques will be essential in apportioning system resources intelligently among multiple contending threads.
Keywords/Search Tags:TLB, Thread, Thesis, On-chip memory hierarchy, System, Performance, Multiple
Related items