Font Size: a A A

For High Scalable And Portable Manycore Full System Simulator Design And Realization

Posted on:2012-04-03Degree:MasterType:Thesis
Country:ChinaCandidate:Z G WangFull Text:PDF
GTID:2218330335498578Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Full system emulator is always used to emulate an entire hardware platform. So it's useful in system software development, debugging, trace generation and statistics analysis.The advance of multi-core computing creates tremendous opportunities and challenges to full system emulator. On one hand, the abundant physical cores provide more resource for full system emulator to harness. On the other hand, the rapid-increasing number of emulated cores requires full-system emulator to be scalable and be able to handle reasonable scale of input. And it also needs to be able to exhibit concurrent behavior of workload, like concurrent problem such as data race. Unfortunately, the commodity full system emulator sequentially emulates multicore on a single physical in a round-robin fashion. So, they cannot fully harness the abundant physical resource and result in poor performance scalability. At the same time, sequential emulator always schedule emulated cores at coarse granularity such as basic block, this implies limited parallelism and sacrifices the emulation fidelity. It's a special critical problem if you want to use it to find the concurrent bugs. However, building a parallel full-system emulator is usually resource-intensive and requires years to be mature. Full-system emulators, unlike user-mode emulators, need to model the system aspects of a computing platform, including system-ISA, address translation, privilege levels, interrupts and a set of devices. Further, building a portable emulator is even harder because of the dramatic differences of both the user-ISA and system-ISA among diverse architectures.This paper proposes a new organizing strategy for parallel emulation. The key observation is that CPU cores and devices in current (and likely future) multiprocessors and multi-core are loosely-coupled and these cores and devices communicate through well-defined interfaces. For example, each core has its own register file, control logic and separate cache. They independently execute instruction stream assigned to it and the communication channels between cores are well defined, such as Inter Processor Interrupt (IPI). Such an organization allows the separation of building fine-tuned sequential emulators from efficiently parallelizing it, thus decreases the complexity of building a parallel full-system emulator. We cluster sequential emulators to build a parallel one, each sequential emulator is used to run as a separate thread and provide a thin layer for efficient synchronization and communication. Though simple, we demonstrate that this strategy is beneficial to address issues in scalability, fidelity, portability and dynamic load balancing. By incorporating support of lightweight transactions, nonblocking data structures and algorithms to balance load among cores, we can construct fast and scalable parallel emulator while reusing most of the mature sequential emulation blocks.Our first complete prototype system, COREMU, is based on QEMU with only 2500 LOC modifications. COREMU fully supports up to 255 cores emulation of X64 architecture and 4 cores of ARM architecture, and can emulate the whole software stack with practical performance. To help programmer to find the problem in the concurrent application, we also make support a smart watch point mechanism and memory trace collection mechanism. By using SPECINT-2000, we show that COREMU has negligible uniprocessor emulation overhead, within 1%. COREMU scales much better than QEMU. COREMU achieves good performance scalability. We evaluate it with a dozen of various benchmarks. The result shows that it scales pretty well when the number of emulated cores is not much than the number of physical cores on the host machine. When the number of emulated cores is larger than physical cores, the performance degradation is still acceptable. It can emulate up to 255 virtual cores with reasonable performance. In contrast, QEMU times out when emulating more than 16 cores due to cache thrashing or contentions. COREMU achieves a speedup from 20X to 67X when 16 virtual cores are emulated on a four quad cores machine.
Keywords/Search Tags:Full-system Emulator, Parallel Emulator, Multicore
PDF Full Text Request
Related items