Design and evaluation of efficient collective communications on modern interconnects and multi-core clusters

Posted on:2011-05-19

Degree:Ph.D

Type:Dissertation

University:Queen's University (Canada)

Candidate:Qian, Ying

Full Text:PDF

GTID:1448390002954831

Subject:Engineering

Abstract/Summary:

Two driving forces behind high-performance clusters are the availability of modern interconnects and the advent of multi-core systems. As multi-core clusters become commonplace, where each core will run at least one process with multiple intra-node and inter-node connections to several other processes, there will be immense pressure on the interconnection network and its communication system software.;To overcome bandwidth limitations and to enhance fault tolerance, using multiple independent networks known as multi-rail networks is very promising. Quadrics multi-rail QsNetII network is constructed using multiple network interface cards (NICs) per node, where each NIC is connected to a rail. I design and evaluate a number of Remote Direct Memory Access (RDMA) based multi-port collective operations on multi-rail QsNet II network. I also extend the gather and allgather algorithms to be shared memory aware for small to medium messages. The algorithms prove to be much more efficient than the native Quadrics MPI implementation.;ConnectX is the newest generation of InfiniBand host channel adapters from Mellanox Technologies. I provide evidence that ConnectX achieves scalable performance for simultaneous communication over multiple connections. Utilizing this ability of ConnectX cards, I propose a number of RDMA based multi-connection and multi-core aware allgather algorithms at the MPI level. My algorithms are devised to target different message sizes, and the performance results show that they outperform the native MVAPICH implementation.;Recent studies show that MPI processes in real applications could arrive at an MPI collective operation at different times. This imbalanced process arrival pattern can significantly affect the performance of the collective communication operation. Therefore, design and efficient implementation of collectives under different process arrival patterns is critical to the performance of scientific applications running on modern clusters. I propose novel RDMA-based process arrival pattern aware alltoall and allgather for different message sizes over InfiniBand clusters. I also extend the algorithms to be shared memory aware for small to medium messages under process arrival patterns. The performance results indicate that the proposed algorithms outperform the native MVAPICH implementation as well as other non-process arrival pattern aware algorithms when processes arrive at different times.;Many parallel scientific applications use Message Passing Interface (MPI) collective communications intensively. Therefore, efficient and scalable implementation of MPI collective operations is critical to the performance of applications running on clusters. In this dissertation, I propose and evaluate a number of efficient collective communication algorithms that utilize the modern features of Quadrics and InfiniBand interconnects as well as the availability of multiple cores on emerging clusters.

Keywords/Search Tags:

Clusters, Modern, Interconnects, Multi-core, Collective, Efficient, Performance, MPI

Related items

1	Research On MPI Collective Communication For Multi-Clusters
2	Key Techniques Of Network-on-Chip Design For Multi-Core System-on-Chip
3	High performance caches and interconnects for many-core and three-dimensional integrated circuits
4	Energy-Efficient and High-Performance Nanophotonic Interconnects for Shared Memory Multicores
5	Co-optimization Of On-chip Interconnects And Cache Coherence For Multi/Many-core Systems Based On Multithread Application Characteristics
6	Energy efficient load latency tolerance: Single-thread performance for the multi-core era
7	High performance and scalable MPI intra-node communication middleware for multi-core clusters
8	Efficient high performance collective communication for distributed memory environments
9	Designing scalable and high performance one sided communication middleware for modern interconnects
10	Design and analysis of high performance and energy-efficient cluster interconnects