Research On The System-Level Optimizing Key Techniques For MPI Communication On Multicore Systems

Posted on:2012-01-11

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Z Q Liu

Full Text:PDF

GTID:1118330362460504

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Over the last decades of the 20th century, MPI (Message Passing Interface) have become a de facto standard of programming model in High Performance Computing (HPC) domain. The performance of MPI communications usually play a key role on the global performance of MPI-based programs. Thus Optimizing MPI communication is becoming extremely important.Recently, with the rapid development of multicore technologies, The optimization of MPI communication in multicore systems is strongly expected to be improved by combining their new characteristics with multicores. However, the existing optimizing techniques still remains at the technologies of process-MPI based communication which often incur performance issues such as large overhead, high memory visit, thus the cur-rent methods still have some limitations on further improving the communication per-formances. Towards addressing these issues of optimizing the performance of MPI communication on multicore systems, we concentrate on studying key optimization strategies from the views of threaded-MPI based communication techniques in this pa-per. As a result of our invesitigation, the following contributions have been achieved:(1) An effective threaded-MPI software technology, called MPI communication ac-celerator (MPIActor), is proposed for multicore systems. Compared with the traditional MPI implementation development methods to develop threaded-MPI, MPIActor with smaller development workload, more flexible usage performs better. Not only that, MPIActor can support all traditional process-based MPI softwares satisfying MPI-2 standards, and can inherit the performance advantages of inter-node communication supported by traditional MPI implementations. The experimental results of OSU_LATANCY benchmark on dual-way Nehalem-EP processor system show that the performance of intra-socket communication can be increased by 37% to 114% and the performance of inter-socket communication can be improved by 30% to 144% for MVAPICH2 1.4 supported by MPIActor in comparison with the pure MVAPICH2 1.4 when transferring 8KB to 2MB messages. At the same time, the experiments also show that the performance of intra-socket communication is increased by 48% to 106% and the performance of inter-socket communication is increased by 46% to 98% for Open MPI 1.5 supported by MPIActor compared to the pure Open MPI 1.5.(2) A novel hierarchical collective communication algorithm framework (MAH-CAF) and a group of effective threaded-MPI based intra-node collective communication algorithms are proposed. MAHCAF constructs hierarchical collective communication algorithms by using template design pattern. The intra-node and inter-node collective communication sub-processes (IntraCP and InterCP) of MAHCAF play extensible roles. IntraCP can not only be implemented by general independent algorithms regardless of multicore architecture but also be designed by the specific algorithms considering the characteristics of multicore architecture in multicore architecture drivers. The experi-ment results of Intel MPI benchmark show that MAHCAF integrated with general in-tra-node collective algorithms can remarkably improve the performance of MPI_Bcast, MPI_Allgather, MPI_Reduce and MPI_Allreduce compared with MVAPICH2 1.6,. In addition, the intra-node reduce algorithm for Nehalem architecture, called hierarchical segment reduce algorithm (HSRA), can greatly improve the performance of MPI_Reduce and MPI_Allreduce.(3) Towards reducing negative performance impact on MPI broadcast incurred by unbalanced processes arrival (UPA) patterns , a novel Competitive and Pipelined (CP) method based on MPIActor is proposed. CP method can regard the first arriving process as a leading process to execute inter-node collective communication by making better use of the advantages of running multiple processes in the intra-node of multicore sys-tem. By doing so, it can start the inter-node collective communication process as soon as possible and reduce the waiting cost. The experiment results of a micro benchmark show that the performance of broadcast algorithms enhanced by CP-method signifi-cantly outperforms the performance brought by other traditional algorithms. In addi-tion, the extensive experiments through running two real world applications also prove that CP method can greatly improve the performance of broadcast in real scenarios.(4) An efficient and effective Shared-memory Message Passing Interface (SMPI) on threaded-MPI is proposed for optimizing intra-node communication on multicore systems. Unlike copying the message from the source process to the destination process, SMPI-supported MPI processes can communicate with each other in manner of directly accessing the buffer of posted message on the same node. In particular, SMPI can be efficiently implemented by utilizing the existing pattern of MPIAcotor. The result of 4000 order square matrix multiplication computed by 64 processes on 8 nodes shows the performance of SMPI based cannon matrix multiplication algorithm can achieve a speedup of about 1.14 in contrast to MPI based algorithm.

Keywords/Search Tags:

MPIActor, Multicore Processor, MPI Communication Optimizing, Threaded-MPI, Hierachical Collective Communicaiton Algorithm, Competitive and Pipelined Method, Shared-Memory Message Passing Interface

PDF Full Text Request

Related items

1	A performance comparison: MPICH, message passing interface against Treadmarks, distributed shared memory
2	Implementation And Performance Analysis Of All-to-all Communication
3	Research On Technology Of Managing Shared Memory In Multicore Operating Systems
4	Research On Communication Techniques Based On Shared Memory And Networks-on-chip On Multicore Architectures
5	Distributed Memory Parallel Numerical Computation Communication Library System
6	Nonblocking Message Passing Based On Deterministic Virtual Memory Model
7	Research On Key Technologies Of Memory System For Synchronous Data Triggered Architecture Based Multicore Processor
8	Research On Key Techniques Of Deterministic Multiprocessing Targeting Multicore/manycore Architectures
9	Research On Multi-core Communication System Based On AMP Architecture
10	Research Of Rapid PN Code Acquisition Using Iterative Message Passing Algorithm