Font Size: a A A

Effective performance analysis and optimizations for memory intensive programs on multicore

Posted on:2011-10-21Degree:Ph.DType:Dissertation
University:Purdue UniversityCandidate:Liu, LixiaFull Text:PDF
GTID:1448390002455509Subject:Computer Science
Memory bandwidth has become the performance bottleneck for memory intensive programs on modern processors due to the increasing gap between on-chip CPU speed and off-chip memory bandwidth. The emergence of multicore chips has made the gap worse. Understanding the memory bandwidth requirement of a program and its performance impact is critical for performance tuning, performance prediction, and optimization strategies. This dissertation proposes a new methodology for analyzing the memory bandwidth requirement and the performance of memory intensive programs on multicore machines. The methodology is based on two notions. The notion of zero-latency instruction issue rate quantifies the ideal execution speed of a program, or different parts of a program, under no memory bandwidth constraint. The notion of memory access intensity quantifies the memory bandwidth requirement of a program or different parts of a program. The memory access intensity can be then used to identify the memory bottlenecks in a program. Under these two basic notions, an analytical prediction model is developed to estimate the program performance. Experiments show that our new methodology accurately identifies the memory bottlenecks and predicts the performance trend for a suite of test programs.;The memory bottlenecks, once identified, may be corrected through appropriate program transformations. Unfortunately, certain transformations may potentially introduce overheads such as additional instructions and reduced parallelism. In this dissertation, we study such complex interactions between instruction count, parallelism, and memory behavior in the context of two program transformations.;First, we discuss how to balance the trade-off between instruction count and memory access intensity for effective compression of integer and pointer arrays. Three encoding methods are developed for our compression scheme. We derive formulae to analyze the cost and benefit of each encoding method. We implement our compression scheme in a compiler to automatically transform the program into different versions corresponding to different encoding methods. Results show that our compiler-automated adaptive scheme improves the execution speed over the original version by an average of 9% for a set of benchmark programs. These results take compression overheads into account in order to make adaptive decisions.;Next, we discuss the use of carefully controlled asynchronous algorithms to attain data locality and parallelism simultaneously for a class of numerical programs. We show that carefully controlled asynchrony and loop tiling can significantly improve the performance of parallel iterative algorithms on multicore processors due to improved data locality and loop-level parallelism. Evidence from experiments with three well-known numerical kernels is presented.
Keywords/Search Tags:Memory, Performance, Multicore, Parallelism
Related items