Font Size: a A A

On Performance Optimization And Evaluation For Multicore Memory Systems

Posted on:2015-08-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z M FangFull Text:PDF
GTID:1108330464955438Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the advent of multicore and many-core processors, the latency and band-width gap between the processor and memory continues to grow. To bridge this gap, the cache memory system has been one of the most important components in multi-core and many-core architecture design, and therefore has become more and more complex. As the cache memory system is key for applications to achieve high per-formance, performance optimization and evaluation on the multi-/many-core cache memory systems has been one of the hotspots in the area of computer architecture re-search.Prior work on multicore cache memory systems has focused on the following four aspects. The first category of work is on software optimizations on exisiting mul-ticore cache memory systems, including cache locality optimization and data prefetching. The second category of work is on performance evaluation of exisiting multicore cache memory systems, which provides support to those software optimiza-tions. The third category of work is on hardware optimizations of multicore cache memory systems, which focues on how to design and organize the cache memory systems so as to improve the performance. The final one is on performance evaluation of hardware design for multicore cache memory systems, which relies heavily on multicore (cache) simulators.According to our observations, there are a few limitations in prior performance optimization and evaluation work on multicore cache memory systems. First, prior work on performance evaluation of existing many-core memory systems mainly re-lied on long elapsed time events (LETE) based micro-benchmarks and only evaluated cache memory latency and bandwidth. However, they did not consider measuring data prefething related micro-architectural details. Moreover, those LETE-based mi-cro-benchmarks did not analyze those interfering factors that could affect the intended micro-benchmark behavior. Second, existing software data prefetching work did not consider the coordination between multi-level cache memories. With the advent of multicore architectures, many new micro-architectural features have been developed, and new hardware limitations to software prefetching also apply. As a result, to achieve high performance, software data prefetching has to be applied in a coordinat- ed way between multi-level cache memories. Finally, existing multicore (cache) sim-ulators usually exploited a tighly-coupled design, which not only makes those multi-core simulators difficult to extend new features, but also limits the performance of the simulator. However, there are few efforts trying to make multicore simulators more extensible.Based on the above analysis, this dissertation mainly presents 1) a novel mi-cro-benchmarking methodology to measure more comprehensive micro-architectural details of existing many-core memory systems; 2) a novel multi-stage coordinated software data prefetching algorithm for multi-level cache hierarchy; 3) a novel loose-ly-coupled, extensible, cycle-accurate multicore simulator. In summary, this disserta-tion makes the following contributions.First, for existing many-core processors, we propose a novel short elapsed time events (SETE) based micro-benchmarking methodology to measure a comprehensive list of memory micro-architectural details, especaially software and hardware data prefetching related parameters not considered in the past studies. In the proposed methodology, we present the first comprehensive analysis of interfering factors that could affect the intended micro-benchmark behavior and a set of design guidelines to precisely control and mitigate those interfering factors. Using the proposed method-ology, we measure many undocumented micro-architectural details on Intel Xeon Phi memory system, especially prefetching related parameters. Based on those measured data, we further provide many useful insights into effective software and hardware prefetching on many-core architectures, such as multi-stage coordinated software prefetching.Second, for the multi-level cache hierarchy on existing many-core processors, we propose a novel multi-stage coordinated software data prefetching algorithm, which brings data from the memory to the L1 cache in stages, cognizant of resource availability at different levels of the cache hierarchy. Moreoever, we consider the in-teraction of our multi-stage coordinated data prefetching with simultaneous mul-ti-threading (SMT) and other software cache locality optimizations such as loop tiling. We have implemented our multi-stage coordinated prefetching algorithm based on an open-source source-to-source compiler called ROSE. Experimental results show that, on average, we achieve 1.55X and 1.3X speedup compared to the hardware prefetchers on the Intel Xeon Phi many-core processor and the state-of-the-art Intel ICC compiler, respectively.Finally, for performance evaluation of hardware design and optimization of mul-ticore cache memory systems, we design and implement a loosely-coupled, extensible, and cycle-accuracte multicore (cache) simulator infrastructure called Transformer. In Transformer, we present the first comprehensive analysis of cycle-inaccurate factors in a loosely-coupled multicore simulator and further propose lightweitht solutions to detect and revise those cycle-inaccurate factors. To further improve the extensibility of Transformer, we design two architecture-independent interfaces:1) communicate interface between functional simulator and timing simulator; 2) application extension library interface for System-on-Chip (SoC) extension. To demonstate the extensibility of Transformer, we extend a widely-used functional simulator called QEMU and a widely-used timing simulator (especially on the cache memory systems) called GEMS together, which only costs two man-months. We also extend IP core simula-tion on FPGA for SoC simulation, which only needs to configurate the application extension library interface. As a result, Transformer provides an easy-to-use platform for future work on hardware optimizations and hardware/software collaborative opti-mizations for multicore cache memory systems.
Keywords/Search Tags:Multicore Caches, Many-core Caches, Micro-benchmarking Methodolo- gy, Multi-stage Coordinated Data Prefetching, Extensible Multicore Simulator
PDF Full Text Request
Related items