Font Size: a A A

Improving The Scalability Of Deterministic Replay

Posted on:2015-05-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y F ChenFull Text:PDF
GTID:1108330464455364Subject:Computer system architecture
Abstract/Summary:Request the full-text of this thesis
Processors are getting more and more cores, more software will start to embrace concur-rency to utilize the power of multi-core processors. Currently, the main stream concur-rent programming model is based on threads and uses locks for synchronization. But this programming model is easy to make mistakes. Researches have found that con-currency bugs are common in even mature open source software and they are hard to reproduce, debug and fix. Finding methods to reproduce concurrency bugs has great research value and can benefit real world concurrent software development.Record and replay the non-determinism from both the computer system itself and the outside world is one way to reproduce the execution of a system. Prior researches have done comprehensive work on record and replay for single core systems. For multi-core systems, shared memory accesses add great amount of non-determinism, thus we need deterministic replay to faithfully reproduce the execution. Recording shared mem-ory access order is one of the most difficult challenges for deterministic replay. Some prior work supports the recording of shared memory access through hardware changes, but these methods can not be used on commodity hardware. While replay systems based on commodity hardware have scalability problems, which may decrease system’s per-formance as the number of cores increases.Based on comprehensive analysis on prior work, we proposed an scalable al-gorithm to record shared memory access using commodity hardware. We imple-mented both full system and application deterministic replay with full system emulator COREMU and binary translation tool DynamoRIO. Performance evaluation using par-allel application benchmarks shows that our replay system has good scalability. In order to further improve performance, we also explored new ways to record shared memory access using hardware transactional memory.Specifically, this thesis makes the following contributions:1. We analyze the cause of scalability problems of prior work, propose a scalable algorithm to record shared memory access, which can be implemented on com-modity hardware. Our algorithm uses shared object version to serialize all write operation, and uses a clever method to record write-after-read order. The algo-rithm records precise ordering, requires only shared object version and core local information when taking log and the critical section is also very short. These attributes makes the algorithm possible to scale.2. ReEmu is the first attempt to provide deterministic replay capability to parallel full-system emulators. We make a comprehensive analysis of non-determinism in full-system emulators and then implement deterministic replay based parallel full-system emulator COREMU. ReEmu implements our proposed shared memory access recording algorithm using a technique similar to seqlock and tries lock clustering optimization to improve performance. Performance evaluation using PARSEC benchmark shows that ReEmu has only 68.9% performance overhead on average (ranging from 51.8% to 94.7%) when emulating 16 cores compared to COREMU. It also shows good scalability emulating from 1 to 16 virtual cores. With the good support of cross architecture emulation support of COREMU, we also implemented deterministic replay for ARM system.3. We also implement an application replay tool Dr. Replay based on DynamoRIO, which is a binary translation tool useful for instrumentation. Dr. Replay record and replay each system call based on its specific semantics. It also records partial order of system calls to aid debugging. We pointed out the challenges of record-ing implicit interactions between applications and operating systems. Evaluation results on PARSEC benchmark shows good scalability when recording using 1-16 cores. The average runtime is 14.8 times (ranging from 11.6x to 19.8x) over native execution, which is better than another binary translation based replay tool PinPlay.4. We make the first attempt of using hardware transactional memory to record shared memory access and proposed an algorithm based on transaction commit order. We also analyze the applicable scenarios of Intel Haswell’s transactional memory extension. Through evaluation, we discover that a simple application of HLE to our shared memory access recording implementation will not gain perfor-mance improvement.
Keywords/Search Tags:deterministic replay, full-system emulator, binary translation, hardware transactional memory
Request the full-text of this thesis
Related items