Font Size: a A A

Performance Modeling and Characterization of Multicore Computing Systems

Posted on:2014-01-06Degree:Ph.DType:Dissertation
University:North Carolina State UniversityCandidate:Krishna, AnilFull Text:PDF
GTID:1458390008955590Subject:Engineering
Abstract/Summary:
The exponential growth in transistor density suggests a similarly exponential growth in the number of on-chip computational contexts. However, bandwidth to off-chip memory is pin-limited, and, therefore, does not grow at the same exponential rate as does transistor density. This leads to a situation referred to as the bandwidth wall problem. The first goal of this dissertation is to quantify the extent to which the bandwidth wall problem affects multi-core scaling, and the extent to which various bandwidth conservation techniques are able to mitigate this problem. We develop a simple but powerful analytical model to predict the number of on-chip cores that a multi-core chip can support in the presence of the bandwidth wall. This model confirms that the bandwidth wall can severely limit core scaling if additional bandwidth conservation techniques are not employed. Further analysis with this model identifies bandwidth conservation techniques that can sustain multi-core scaling for the next several technology generations.;Continued increase in the on-chip core counts makes such processor chips attractive for running parallel, multi-threaded, workloads. Multi-threaded workloads can share a non-trivial fraction of the instructions and data between threads. The second goal of this dissertation is to study the impact of such sharing on chip design. We propose a methodology that quantifies the reduction in on-chip cache miss rate that is solely attributable to the presence of data sharing. We incorporate the impact of data sharing in contemporary multi-threaded benchmarks into an analytical model that projects multi-core chip performance. We find that the optimal design point for a multi-core chip is substantially different when the impact of data sharing is considered.;There remains a continued abundance of transistors per chip, however the transistor power density is growing at each technology generation. This trend is forcing the industry to pay particular attention to the power-performance tradeoff and is encouraging hardware specialization. Instead of integrating more and more identical general purpose cores on a chip, specialized com- putation engines called Hardware Accelerators are being integrated on a chip. A key chip-design challenge in this space is balancing the general purpose computation capability with hardware acceleration of selective functions, while supporting a convenient programming model. A third goal of this dissertation is to evaluate the architectural considerations, design choices and performance potential of hardware acceleration. To this end, we perform an in-depth study of IBM's PowerEN processor, one of the first multi-core chips to integrate programmable hardware accelerators alongside general purpose cores. We find that hardware acceleration has the potential to improve throughput by orders of magnitude for some applications. We also find that a coherent shared memory architecture is an important part of enabling accelerators to be easily accessed from user level code running on the general purpose cores.;Recent research makes a case for adding heterogeneity across the general purpose cores in a multi-core chip. Heterogeneity can help improve the overall chip performance as well as its energy efficiency. However, evaluating the corresponding chip design space is a big challenge. The number of unique designs in the design space grow with the number of cores, and the type and granularity of the heterogeneity. Detailed cycle-accurate simulation is often far too slow to be able to exhaustively evaluate the design choices. Added to that is the complication of evaluating the performance of each design across the many application schedules. Even a single mix of applications can lead to a combinatorially scaling set of static application-to-core mappings as the number of non-identical cores scales linearly. A fourth goal of this dissertation is to develop a fast, analytical modeling framework, ReSHAPE, that estimates chip level performance for a heterogeneous multi-core chip, while allowing shared cache configurations and respecting the impact of limited memory bandwidth on chip performance. Being an anlytical model at its core, ReSHAPE achieves orders of magnitude speedup over instruction-driven, timing simulation. We validate ReSHAPE's accuracy against a timing approximate full-system simulator. We use ReSHAPE to study several interesting chip level optimization questions. For example, we find that as the number of cores on a chip increases, the potential benefits from heterogeneity can increase provided there is a good application-to-core mapper. Increasing the granularity of heterogeneity (the number of unique core and cache sizes), however, does not benefit beyond a point. (Abstract shortened by UMI.).
Keywords/Search Tags:Chip, Core, Performance, Bandwidth, Model
Related items