Font Size: a A A

Research On Efficient Parallel Performance Simulation For Computer Architecture

Posted on:2012-07-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:C F XuFull Text:PDF
GTID:1118330362460462Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Simulation, modeling and benchmarking are considered as three main performance evaluation techniques for computer systems. As it can achieve better trade-off between the cost and flexibility of performance evaluation, simulation is paid more and more attention both in academe and industry. Simulator codes minic hardware behaviors, and thus can be several orders slower than the execution of target program on realistic machine. Traditional serial methodology can no longer meet the capacity and time need of architecture simulation, and this is especially true when simulating large-scale computer systems. Consequently, parallel performance simulation using parallel host machine is becoming more and more popular. But with the increasing of architecture complexity and program size, even the efficiency of parallel performance simulatoion for large-scalel machines is becoming a bottleneck for its wide and practical use.This paper focuses on how to improve the efficiency and accuracy of parallel performance simulation for computer architeucre. Our works involve efficient parallel performance simulation of processor architecture, efficient mapping and adaptive synchronization in parallel simulation of large-scale machine, the design, implementation and application of parallel simulator for new architecture. The main contributions of the paper include:(1) We establish an analytical model for time-division parallel simulation of processor architecture. Some useful conclusions about parallel speedup and efficiency for typical parallel simulation configurations are drawn based on the model. We analyze the load-imbalance problem among parallel simulation nodes in previous time-division approaches, and propose SEDSim, a scalably and evenly distributed simulation approach for processor architecture. SEDSim uses a cost model guided evenly partition and allocation (CoMEPA) policy to achieve theretically perfect load-balance. We also propose an allocation algorithm based on minimum equivalent cost (MinEC) to integrate arbitarty number of inconsecutive sampling intervals with SEDSim. Both theretical analysis and experimental results validate the advantages of SEDSim.(2) We propose MinCoM, a minimum communication-guided mapping method for efficient mapping of logical processes (LP) to physical processes (PE) in the parallel simulation of large-scale systems. MinCoM uses communication information among LPs, which is extracted from traces, to generate mapping results and assures that the mapping can achieve a minimum communication among PEs. Based on MinCoM, we propose A2-MinCoM, a mapping scheme using array assignment for applications with regular communication patterns, and TP-MinCoM, a two-phase mapping scheme for popular multi-core host machines. Experimental results show that our mapping schemes can significantly reduce execution time of parallel simulation when compared to traditional block and cyclic mapping schemes.(3) We exploit application behaviors to optimize adaptive syhchronization for parallel simulation of large-scale computer systems. We propose ETD-Adaptive, an adaptive method based on event trigger degree (ETD), and Iter-Adaptive, a two-phase hybrid adaptive method for parallel simulation of iterative applications. Future events and their dependencies extracted from traces are used by ETD-Adaptive to guide the adaptive adjustment of the time window. Iter-Adaptive collects information about time window in the adjustment procedure while simulating iterative applications, and then establishes an appropriate and stable time window for subsequent simulation. Our methods are implemented and tested on BigSim parallel simulator. Test results validate their effectiveness and advantages.(4) We evaluate the effect of trace generation in large-scale trace-driven parallel simulation. Several target parallel applications with different computation to communication ratioes and three host machines with different trace I/O modes are selected and examined. Results show that trace generation has a significant effect on both simulation efficiency and simulation accuracy. The reasons of the trace effect are analyzed and some possible resolutions are also discussed. The conclusion of our evaluation is helpful to the design, implementation and use of trace-driven parallel architecture simulators.(5)We design and implement a parallel simulator MCPSim for performance prediction and analysis of multi-core cluster. MCPSim can simulate three kinds of message communications, namely intra-CMP message, inter-CMP message and inter-Node message, in its functional model and performance model. Thus MCPSim can not only produce accurate performance prediction result, but also support the profiling of detailed message communication behavior including message volume and message distribution of different sizes. It is a useful tool for performance prediction and analysis of message passing programs on multi-core clusters.
Keywords/Search Tags:Computer architecture, Parallel performance simulation, Time-division, Mapping, Adaptive synchronization, Trace, Multi-core cluster
PDF Full Text Request
Related items