Shuffle Performance Optimization Of Spark Based On RDMA Technology

Posted on:2018-07-19

Degree:Master

Type:Thesis

Country:China

Candidate:R J Yu

Full Text:PDF

GTID:2428330569998665

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

In recent years,with the growing rise of memory computing.Spark system as a large data processing system based on memory computing has been widely used in various fields around the world.Compared with Hadoop,Spark in performance has been greatly improved,especially in interactive and iterative computing.Spark system follows the Map-Reduce framework.Shuffle is still one of the important stages,this stage has increasingly become a performance bottleneck Spark system.The Shuffle phase of the performance bottleneck is that the network transmission speed is not fast enough.RDMA has been widely used in recent years.His low-latency,high-bandwidth nature,let people naturally think of RDMA technology to accelerate the Spark system network transmission,thereby speeding up his Shuffle process,and finally to enhance the performance of the whole system.This paper focuses on two parts,one part is the design and implementation of a Spark system for the characteristics of the underlying RDMA transmission engine.Another part of the work is to optimize the structure of the Spark Shuffle module itself,the transmission engine to minimize the overhead of integration into the Spark system.In the first part of the work,we designed the RDMA transmission engine for the first time using connectionless data transmission mode to achieve Spark system transmission module.In order to avoid too much activity connected to the RNIC chip cache Cache Miss rate increase,thus affecting the RDMA transmission performance of this problem.The use of datagram transmission has brought advantages,but also introduced a new problem.In order to take advantage of the connectionlessness of datagrams,and to address the problems of this approach.In this paper,we first propose a mechanism of message fragmentation,out-of-order parallel fragmentation and fragment reassembly to adapt to the datagram transmission mode and exploit the parallelism in order to make full use of the bandwidth of the physical network and multi-Thereby increasing the overall system throughput.Then this paper designed a dynamic buffer pool to manage the buffer used in the transmission process,and for the application may be a large number of one-time request for the characteristics of the buffer zone design and release the application to optimize the buffer pool performance.In the second part of the work,we have optimized the Spark Shuffle module itself,using the Java virtual machine heap memory to replace the direct mapping of the Java virtual machine heap memory to local memory copy,in order to achieve the purpose of reducing overhead.Finally,this paper uses the current popular and open source large data Benchmark test,BigDataBench,the actual equipment in the small clusters of Infiniband network on the optimized system was tested.The results show that the performance of theoptimized system is about 16% higher than that of Spark system using Socket communication.

Keywords/Search Tags:

Spark, RDMA, Shuffle, Network, Transport, Performance optimization

PDF Full Text Request

Related items

1	Optimization And Implementation Of Data Transmission Mechanism Based On RDMA
2	Optimization And Implementation Of Data Transmission Strategy Based On RDMA
3	Research On Spark Shuffle Process Performance Optimization
4	Research On Key Technologies And Application On YARN For High-Performance Computing
5	Performance Optimization Methods For Shuffle Process Of Spark Platform
6	Research On Large-scale Traffic Classification Technology Based On Spark Performance Optimization
7	Research On Shuffle Mechanism In Spark Cluster
8	Research On Task Execution Optimization In Spark
9	Optimization Of Spark Task Scheduler For Shuffle Operators
10	Key-Value Store Performance Optimization Based On RDMA