Font Size: a A A

Research On Memory Race Recording Mechanism For Deterministic Multiprocessor Replay

Posted on:2014-05-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:S X ZhuFull Text:PDF
GTID:1268330392972677Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the popularity of chip multiprocessors, shared memory multithreadedprogramming has become an important means to explore the parallel performanceof multi-core systems. However, running the same multithreaded program on thesame multi-core machine several times can potentially lead to different outcomesfor each run. This nondeterministic behavior poses many challenges for the writing,debugging and application of multithreaded programs.Two-phase multithreaded deterministic record-replay is an effective approachto solve these difficulties through reproducing multithreaded program’s execution.In the recording phase, nondeterministic information is recorded into logs toenable the second phase to deterministically replay the recorded execution. As themain source of nondeterministic information, memory races are frequent andconsume high performance overhead in recording. So memory race recordingbecomes the key technology to achieve deterministic replay. To challenge priormemory race recording mechanisms, this paper strives to study furter in memorylog size, hardware consumption, bandwidth overhead and replay speed.This paper proposes a chunk-based memory race recording mechanism. Thepoint-to-point memory race logging approach has advantage in replay speed andhas better application prospects, but it consumes too much hardware. So this paperstudies point-to-point logging style with low hardware overhead and achunk-based memory race recording mechanism is proposed. This mechanism usescurrent dependency which is constituted by current instruction counts of theconfliting threads to present a memory race, doesn’t need to store the exactinstruction counts corresponding conflicting instructions, and reduces the numberof conflicts to be recorded. In memory race detection and recording, a chunk-basedtransitive reduction algorithm is introduced. An approximate maximum timestampalgorithm is used to detect the conflicts of the uncached memory blocks. In itshardware implementation, one chunk count is stored as the logic timestamp foreach cache block and hardware consumption is reduced greatly. To optimizememory race log further, it uses an interval instruction count instead of theinstructions count. Simulation results show that low hardware comsumption, smallmemory race log and small bandwidth overhead are achieved.This paper proposes a cyclic memory race recording mechanism basedsignature. In order to support memory race recorder, the original cache structure ismodified in prior mechanisms. This modification can introduce design challenges for multiprocessors. So a cyclic memory race recording mechanism basedsignature is proposed. This mechanism uses signature which is an efficienthardware implementation and can achieve better time and space efficiency todetect memory conflict. It adds one read signature and one write signature for eachchunk and detects conflict through the adding and test operation on signatureinstead of the read and write operation on cache. In its hardware implementation,each processor only adds several signatures with small size. Hardwareconsumption is reduced further for the point-to-point logging approach and theoriginal cache need no change. On this basis, a cyclic dependency is proposed toreduce the the continuous current dependencis with the same direction andoptimzes the memory race log on the basis of replaying programs deterministicly.In its hardware implementation, a direction detection mechanism with smallhardware detects continuous current dependencies with the same direction andcyclic dependencies are logged into memory race log. Simulation results show thathardware consumption, memory race log and bandwidth overhead have beenimproved greatly compared with prior point-to-point mechanisms.This paper proposes a synchronization aware memory race recordingmechanism. Not all conflicts are memory races and not all memory races affectdeterministic replay. If those conflicts which have no effect on replay are filteredout in recording phase, memory race log will be reduced significantly. So asynchronization aware memory race recording mechanism is proposed. In thismechanism, the impact on deterministic replay of the conflicts introduced bysynchronization operations is analyzed offline, and they are divided into harmfulsynchronization conflicts which affect dterministic replay and benignsynchronization conflicts which don’t affect dterministic replay. By adding twonew instructions, it can identify two synchronization operations including lock andbarrier, and prohibit the memory addresses which introduce benignsynchronization onflicts add into signatures in conflict detection stage. So thissynchronization aware recording mechanism filters out the benign synchronizationconflicts from memory race log. Simulation results show that memory race log isreduced significantly and small bandwidth overhead is also achieved.This paper proposes a separate memory race recording mechanism. Replayspeed is a key aspect to measure a memory race recorder. However prior parallelreplay is low-efficiency. In order to improve parallel replay speed, this paperproposes a separate memory race recording mechanism. In this mechanism, currentdependencies are recorded separately. The reply in coherence protocol records thepredecessors of current dependencies and the requestor records the successors. Only a conflict bit is appended on the coherence response message, and thecommunication overhead is reduced significantly in recording phase. At the sametime, it introduces an adaptive log format with two sizes. Proccessor can select alog format with appropriate size to record predecessors and successors, and thesize of each log is reduced. More importantly, in this separate logging method, thepredcessors can create and send wake-up messages for successors purposely, andthe efficiency of replay is improved significantly in replay phase. Simulationresults show that the replay speed, memory race log and bandwidth overhead allbehave better.The four memory race recording mechanisms proposed in this paper improvethe performance of point-to-point logging approach in the following five aspects:the memory race log, hardware consumption, bandwidth overhead, replay speedand applicability. They also have obvious advantages compared to other recordersin the following aspects: memory race log, bandwidth overhead, replay speed andapplicability.
Keywords/Search Tags:chip multiprocessor, multithreaded program, deterministic replay, memory conflict, memory race recording, sychronization
PDF Full Text Request
Related items