Font Size: a A A

Research Of Performance Optimization For Data Skew Based On High-speed Networks

Posted on:2021-01-21Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y HeFull Text:PDF
GTID:2428330620968177Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The development of hardware technology promotes the update of modern data pro-cessing systems.On the computing,the mature multi-core and many-core technology in modern CPU has allowed most data processing systems to partition data and process it in parallel.It fully exploits the characteristics of modern CPU.On the storage,large ca-pacity memory allows data processing systems to cache data in advance to avoid the I/O bottleneck introduced by traditional disks.Caching data in memory significantly im-proves the processing performance.On the network,with the development of RDMA,it has gradually become popular in modern data centers.The high-bandwidth and low-latency RDMA network could effectively solve the classic network bottleneck problem in distributed systems.Data skew has always been one of the most important issues that degrade performance.Therefore,on the basis of modern hardware,the reanalysis and research of data skew is an interesting topic.If data skew occurs,the size of data partitions would be different after data pro-cessing systems load data in memory.The execution tasks responsible for the larger partitions would become the performance bottleneck,which degrades the performance.Most traditional approaches to handling data skew are based on sampling and reparti-tioning.For sampling,it is difficult to balance accuracy and additional overhead that it introduces.Repartitioning often needs to suspend the execution phase.Both meth-ods incur the additional overhead.In this dissertation,we reanalyze the distributed data processing in the presence of data skew,and implement an optimization scheme on the Apache Spark.It accelerates the performance of Spark SQL library with data skew.The main work and contributions of this dissertation could be summarized follows:(1)We propose a dynamic execution optimization scheme to handle intra-node data skew.It is lightweight and transparent to users.The core idea is based on data stealing.When data skew occurs,execution tasks for smaller data partitions will actively steal data from the execution tasks for larger data partitions after process-ing their own partitions.(2)Based on dynamic execution optimization,we further analyze the inter-node data skew problem,and propose DS2(Two-Phase Data Stealing)to solve both intra-node and inter-node data skew.It first handles the intra-node data skew problem,and then solves the inter-node data skew problem through the RDMA network.DS2 consists of three phases,namely loading phase that loads data from disk to memory,task generation phase that generates execution tasks,and data processing phase that processes data.By optimizing these three phases,DS2 can cope with different execution operators and different data distributions caused by different degrees of data skew.(3)We combine the specific scenario of remote data stealing and the characteristics of RDMA to consider,and design a hybrid RDMA remote data stealing scheme to steal data from the remote node.It adopts a hybrid scheme that makes full use of one-sided and two-sided operations in RDMA,which improves the efficiency of data stealing between nodes.(4)We implement a system prototype Spark-DS2on the basis of Apache Spark.Besides,we test its performance on different workloads with different degrees of data skew by Spark SQL.In summary,we study the intra-node and inter-node data skew,and propose DS2 on high-speed RDMA-capable networks.It increases the processing performance with data skew from data loading phase,task generation phase and data processing phase.Besides,we implement DS2 in the Apache Spark,and prove that it can effectively improve the performance of Spark SQL when data skew happens.
Keywords/Search Tags:data skew, Spark SQL, data stealing, distributed computing, RDMA
PDF Full Text Request
Related items