Research Of Performance Optimization For Data Skew Based On High-speed Networks

Posted on:2021-01-21

Degree:Master

Type:Thesis

Country:China

Candidate:Z Y He

Full Text:PDF

GTID:2428330620968177

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

The development of hardware technology promotes the update of modern data pro-cessing systems.On the computing,the mature multi-core and many-core technology in modern CPU has allowed most data processing systems to partition data and process it in parallel.It fully exploits the characteristics of modern CPU.On the storage,large ca-pacity memory allows data processing systems to cache data in advance to avoid the I/O bottleneck introduced by traditional disks.Caching data in memory significantly im-proves the processing performance.On the network,with the development of RDMA,it has gradually become popular in modern data centers.The high-bandwidth and low-latency RDMA network could effectively solve the classic network bottleneck problem in distributed systems.Data skew has always been one of the most important issues that degrade performance.Therefore,on the basis of modern hardware,the reanalysis and research of data skew is an interesting topic.If data skew occurs,the size of data partitions would be different after data pro-cessing systems load data in memory.The execution tasks responsible for the larger partitions would become the performance bottleneck,which degrades the performance.Most traditional approaches to handling data skew are based on sampling and reparti-tioning.For sampling,it is difficult to balance accuracy and additional overhead that it introduces.Repartitioning often needs to suspend the execution phase.Both meth-ods incur the additional overhead.In this dissertation,we reanalyze the distributed data processing in the presence of data skew,and implement an optimization scheme on the Apache Spark.It accelerates the performance of Spark SQL library with data skew.The main work and contributions of this dissertation could be summarized follows:(1)We propose a dynamic execution optimization scheme to handle intra-node data skew.It is lightweight and transparent to users.The core idea is based on data stealing.When data skew occurs,execution tasks for smaller data partitions will actively steal data from the execution tasks for larger data partitions after process-ing their own partitions.(2)Based on dynamic execution optimization,we further analyze the inter-node data skew problem,and propose DS2(Two-Phase Data Stealing)to solve both intra-node and inter-node data skew.It first handles the intra-node data skew problem,and then solves the inter-node data skew problem through the RDMA network.DS2 consists of three phases,namely loading phase that loads data from disk to memory,task generation phase that generates execution tasks,and data processing phase that processes data.By optimizing these three phases,DS2 can cope with different execution operators and different data distributions caused by different degrees of data skew.(3)We combine the specific scenario of remote data stealing and the characteristics of RDMA to consider,and design a hybrid RDMA remote data stealing scheme to steal data from the remote node.It adopts a hybrid scheme that makes full use of one-sided and two-sided operations in RDMA,which improves the efficiency of data stealing between nodes.(4)We implement a system prototype Spark-DS2on the basis of Apache Spark.Besides,we test its performance on different workloads with different degrees of data skew by Spark SQL.In summary,we study the intra-node and inter-node data skew,and propose DS2 on high-speed RDMA-capable networks.It increases the processing performance with data skew from data loading phase,task generation phase and data processing phase.Besides,we implement DS2 in the Apache Spark,and prove that it can effectively improve the performance of Spark SQL when data skew happens.

Keywords/Search Tags:

data skew, Spark SQL, data stealing, distributed computing, RDMA

PDF Full Text Request

Related items

1	Research On Partition Loading Balance Based On Spark Data Skew
2	Spark Task Scheduling With Data Skew And Deadline Constraints
3	Research On Apache Spark Distributed Parallel Computing Framework Optimization Technology
4	Research On Data Skew Optimization In Spark Computing Framework
5	Research Of Data Skew On Spark Based On Imporved Partition Method
6	Research And Implementation Of Balanced Partition Method Based On Spark Computing Granularity Adjustment
7	Research On And Application Of The Solution For Spark Data Skew Scenarios
8	A System For Distributed MD Data Analysis Based On Spark
9	An Intermediate Data Placement Algorithm For Load Balancing In Spark Computing Environment
10	Research And Implementation On Anti-skew Spark Intermediate Data Partition Mechanism