Font Size: a A A

Optimization And Implementation Of Data Transmission Strategy Based On RDMA

Posted on:2019-02-18Degree:MasterType:Thesis
Country:ChinaCandidate:B LiuFull Text:PDF
GTID:2428330611493263Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In the era of big data,the growth of massive data,big data processing framework and deep learning platform are evolving,in order to obtain more efficient and useful information from massive data.Apache Spark is a lightning-fast unified analytics engine for large-scale data processing.However,shuffling data across the stages in a cluster is timeconsuming because it will place significant burden on operating system on both the source and the destination by requiring many remote files and network I/Os.Thus,it could not fully take advantage of the performance benefits provided by high-speed interconnects.This has become the major performance bottleneck for Apache Spark.For deep learning platform technology,researchers quickly recognize that deep learning has very similar characteristics to large-scale HPC applications.Thus,beginning from 2016,the established MPI interface became the de-facto portable communication standard in distributed deep learning.On the other hand,with the price for the hardware has dropped dramatically,most data centers today are equipped with modern high-performance interconnects such as InfiniBand.We can use it design more efficient data transmission strategies to achieve high bandwidth and low latency.In order to accelerate the data transmission of Spark Shuffle and improve the training efficiency of deep learning,this paper studies and optimizes the data transmission strategies based on RDMA features.The main contributions of this dissertation are as follows:(1)Accelerating Spark Shuffle with RDMA.We present a hybrid approach for Apache Spark that incorporates communication over conventional sockets as well as RDMA over InfiniBand.Our analysis focus on the architecture and data flow of the shuffle phase.Then,we propose a new RDMA-based design which provides tiering memory pool and uses SEND/RECV and READ to transfer small and large messages.Our accelerations to the data transfer layer through this native RDMA-based data shuffle can benefit Spark workloads transparently.(2)Research and optimization of high performance collective communications library based on RDMA.Mobius is a high-performance distributed collective communications library for deep learning.Its logical architecture is divided into four layers: transport layer,algorithm layer,policy layer and input layer.The transport layer encapsulates TCP Sockets and RDMA communication.The algorithm layer called it to transfer data.And it dynamically select the communication protocol and communication algorithm at the policy layer.The performance of the Mobius collective communications library is better than gloo library,but there is still a certain gap from NCCL2.
Keywords/Search Tags:RDMA, InfiniBand, Spark Shuffle, Collective Communications library
PDF Full Text Request
Related items