Font Size: a A A

Research On Performance Optimization For Parallel Discrete Event Simulaiton On Multi-core Cluster

Posted on:2012-12-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:L L ChenFull Text:PDF
GTID:1118330362960406Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The current trend in processor architecture design is the integration of multiple cores on a single processor. Clusters made of such microprocessors are widely adopted by Parallel Discrete Event Simulation (PDES) developers for large-scale simulation applications. The tightly integrated processing cores in one chip with communication latencies substantially lower than those present in conventional clusters provide potential performance improvement especially for the fine-grained PDES. Thus, in the PDES domain, one of the research focuses is on modifying software platforms to efficiently utilize the computation resources of multi-core processors.Considering the characteristics of the Multi-core clusters and Parallel Discrete Event Simulation System, this dissertation investigates solutions to improve the performance of large-scale or complex simulation programs from various factors which may affect the performance of parallel discrete event simulation, including event-scheduling, shared-attribute access and communication optimization. The innovations of this paper are as follows:Firstly, a synchronization algorithm which combines multi-thread parallel mode with MPI is proposed. Experimental results show that multi-thread parallel scheme outperform multi-process parallel mode in most cases. But current available simulation engines lack supports to combining multi-thread parallel mode with MPI for distributed compute environment or the relative technologies are not mature. In this paper, the compatibility of multi-thread parallel mode to cluster computing platform is considered and a time management mechanism combining multi-thread parallel mode on each machine with MPI communication for all the machines in cluster. A group of tests have been performed and the results show that this hybrid mechanism runs very well on multi-core cluster.Secondly, a global schedule mechanism based on a distributed event queue to improve the performance of Time Warp system on multi-core systems is proposed. The current dynamic load balancing technologies cann't reach the twin goals of good balance and low event-scheduling overhead. In this paper, taking advantage of multi-core architecture with shared memory address space and low communication, a global schedule mechanism based on a distributed event queue is proposed. Its specially designed data structures and algorithms reduce the cost of lock operations much. Comparing with the distributed event queue local schedule mechanism, the experiment results show that the distributed queue global schedule mechanism can effectively decrease rollback rate and balance the workloads at a low event scheduling cost for Time Warp system on multi-core platforms.Thirdly, a shared attribute/state access mechanism based on transactional memory to make users easier to model their system and improve the performance of Time Warp system on multi-core systems is proposed. This mechanism implements transparent access to shared attributes with simple API and provides more powerful modeling ability for agent-based simulation application. A case study is given to demonstrate how to use this mechanism and what merits it brings. Theoretical analysis shows that this access mechanism is able to not only ease the attribute-publishing/subscribing burden on simulation model developers but also reduce the number of messages. The experiment results show that the STM-based shared attribute access mechanism prominently outperforms the conventional"pull"mechanism on multi-core platforms.Fourthly, a more effective latency-hiding mechanism in the parallelization of agent-based model simulations (ABMS) with millions of agents is proposed. The current B+2R latency-hiding algorithm only hides part of communication latency. In this paper, a new latency-hiding algorithm is proposed. The principle of this algoritm is that certain redundant computation trade communication. An analytical model for this algorithm is given and theoretical analysis shows that this algorithm can hide all the communication latency when a proper R is selected. In addition, a B+2(R×r) algorithm which combines the new and old B+2R algorithm is designed to make the new B+2R algorithm is effective on GPU platform. The experiment results indicate the benefits of the new B+2R latency-hiding scheme, delivering as much as over 40% improvement in runtime for certain benchmark ABMS application scenarios with several billion agents.Finally, much performance optimization work on a simulation application to forecaste the trend of public opinion under critical condition has been done to reduce the memory overhead and get scalability. The experimental results demonstrate that the system scale increases one order on single multi-core machine and good scalability is shown on multi-core cluster.
Keywords/Search Tags:parallel discrete event simulation, multi-core, many-core, cluster, parallelization, event-scheduling, shared-attribute access, communication latency-hiding
PDF Full Text Request
Related items