Research On Efficient Parallel Performance Simulation For Computer Architecture

Posted on:2012-07-23

Degree:Doctor

Type:Dissertation

Country:China

Candidate:C F Xu

Full Text:PDF

GTID:1118330362460462

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Simulation, modeling and benchmarking are considered as three main performance evaluation techniques for computer systems. As it can achieve better trade-off between the cost and flexibility of performance evaluation, simulation is paid more and more attention both in academe and industry. Simulator codes minic hardware behaviors, and thus can be several orders slower than the execution of target program on realistic machine. Traditional serial methodology can no longer meet the capacity and time need of architecture simulation, and this is especially true when simulating large-scale computer systems. Consequently, parallel performance simulation using parallel host machine is becoming more and more popular. But with the increasing of architecture complexity and program size, even the efficiency of parallel performance simulatoion for large-scalel machines is becoming a bottleneck for its wide and practical use.This paper focuses on how to improve the efficiency and accuracy of parallel performance simulation for computer architeucre. Our works involve efficient parallel performance simulation of processor architecture, efficient mapping and adaptive synchronization in parallel simulation of large-scale machine, the design, implementation and application of parallel simulator for new architecture. The main contributions of the paper include:(1) We establish an analytical model for time-division parallel simulation of processor architecture. Some useful conclusions about parallel speedup and efficiency for typical parallel simulation configurations are drawn based on the model. We analyze the load-imbalance problem among parallel simulation nodes in previous time-division approaches, and propose SEDSim, a scalably and evenly distributed simulation approach for processor architecture. SEDSim uses a cost model guided evenly partition and allocation (CoMEPA) policy to achieve theretically perfect load-balance. We also propose an allocation algorithm based on minimum equivalent cost (MinEC) to integrate arbitarty number of inconsecutive sampling intervals with SEDSim. Both theretical analysis and experimental results validate the advantages of SEDSim.(2) We propose MinCoM, a minimum communication-guided mapping method for efficient mapping of logical processes (LP) to physical processes (PE) in the parallel simulation of large-scale systems. MinCoM uses communication information among LPs, which is extracted from traces, to generate mapping results and assures that the mapping can achieve a minimum communication among PEs. Based on MinCoM, we propose A2-MinCoM, a mapping scheme using array assignment for applications with regular communication patterns, and TP-MinCoM, a two-phase mapping scheme for popular multi-core host machines. Experimental results show that our mapping schemes can significantly reduce execution time of parallel simulation when compared to traditional block and cyclic mapping schemes.(3) We exploit application behaviors to optimize adaptive syhchronization for parallel simulation of large-scale computer systems. We propose ETD-Adaptive, an adaptive method based on event trigger degree (ETD), and Iter-Adaptive, a two-phase hybrid adaptive method for parallel simulation of iterative applications. Future events and their dependencies extracted from traces are used by ETD-Adaptive to guide the adaptive adjustment of the time window. Iter-Adaptive collects information about time window in the adjustment procedure while simulating iterative applications, and then establishes an appropriate and stable time window for subsequent simulation. Our methods are implemented and tested on BigSim parallel simulator. Test results validate their effectiveness and advantages.(4) We evaluate the effect of trace generation in large-scale trace-driven parallel simulation. Several target parallel applications with different computation to communication ratioes and three host machines with different trace I/O modes are selected and examined. Results show that trace generation has a significant effect on both simulation efficiency and simulation accuracy. The reasons of the trace effect are analyzed and some possible resolutions are also discussed. The conclusion of our evaluation is helpful to the design, implementation and use of trace-driven parallel architecture simulators.(5)We design and implement a parallel simulator MCPSim for performance prediction and analysis of multi-core cluster. MCPSim can simulate three kinds of message communications, namely intra-CMP message, inter-CMP message and inter-Node message, in its functional model and performance model. Thus MCPSim can not only produce accurate performance prediction result, but also support the profiling of detailed message communication behavior including message volume and message distribution of different sizes. It is a useful tool for performance prediction and analysis of message passing programs on multi-core clusters.

Keywords/Search Tags:

Computer architecture, Parallel performance simulation, Time-division, Mapping, Adaptive synchronization, Trace, Multi-core cluster

PDF Full Text Request

Related items

1	Research On Performance Optimization For Parallel Discrete Event Simulaiton On Multi-core Cluster
2	Parallel Simulation Of Large Scale Computer Systems
3	Research On Parallel Simulation Engine Based On Multi-core
4	Research On Architecture Of Real-time Cluster Computer
5	Multi-core DSP Parallel Architecture Design Of Time-domain SAR Imaging Algorithm In Large Squint Angle
6	Research On Image Processing System Based On Multi-core DSP TI-C6678
7	THE DESIGN AND PERFORMANCE OF A PARALLEL COMPUTER ARCHITECTURE FOR SIMULATION
8	Model And Algorithm Of Petri Nets Parallelization Based On Multi-core Cluster
9	A Research Of Quantum Computer Simulation Based On CPU+GPU Hybrid Architecture Cluster
10	Parallelization Of Simulation Model Portability Specification On Multi-core Computer