Font Size: a A A

Research And Optimization Of Adaptive Techniques For Mitigating Skew In Spark

Posted on:2017-01-05Degree:MasterType:Thesis
Country:ChinaCandidate:J D YuFull Text:PDF
GTID:2428330590988897Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Technologies of Big data and its related processing have become one of the most important fields of computing science and industrial community.As the new knowledge acquisition paradigm,commercial companies and academic research organizations have served the abilities of obtaining and processing large amount of data as their core competitiveness.However,big data is facing many problems such as the non-linear growth of the scale of the data,the traditional database technology has been unable to meet the requirements of big data.Googled MapReduce framework proposed in 2004 became the milestone technology of big data,which is losing its advantages for new applications in recent years.The Spark framework based on RDD programming model proposed by Matei Zaharia in 2011 has become an important role of batch computing,iterative computing and streaming computing.However,the problem of data skew also exists in these two common frameworks above,only few system like HBase and Pig deal with this problem in fact.This paper explores the principle of RDD and execution flow of Spark with traditional technology of data skew in MapReduce,then analyses the defects to process data skew,which lead to improving the performance in data-skew scenario with little code change of Spark.Spark sets its scheduling granularity in tasks,whose size of processing data is immutable.However,Spark has much optimizing space with data skew because the good layered architecture in its implementation.The core idea of SASM(Spark Adaptive Skew Mitigation)is moving its scheduling granularity to file blocks,which makes data scheduled in tasks to balance tasks' load.To reduce the overhead of disk I/O before file scheduling,SASM makes use of the metadata of file blocks during task runtime to compute its scheduling strategy.The reduce tasks accept blocks transferred by both Spark and SASM.The scheduling system is based on asynchronous message model,which decrease the blocking synchronization and total overhead of communication and need to check messages for validity.SASM makes scheduling more fair with the collector of speeds of computing and transmission of network imported.Some other optimizations are also discussed,including executor pool and fetch stage.All of these help to minimize the overhead of SASM and improving the effectiveness.Comprehensive experiments are designed and performed to fully understand the behavior,overhead and performance of SASM's every single aspect.We deeply analyze the experiment data and other performance statistics and conclude the SASM outperforms Spark with data skew but underperforms without data skew because blocking synchronization imported in some types of tasks.SASM produces only few overhead of scheduling algorithm and shows its effectiveness with imbalance of computing and network transmission.SASM introduces really slight changes to origin Spark code base and makes a pretty good case for why simple change can provides significant benefits.Carefully collecting and analyzing program performance statistics along with comprehensive bottleneck inspection and optimization is key to a success of a high performance system.
Keywords/Search Tags:MapReduce, RDD, Spark, Task Scheduling, Adaptive, Data Skew
PDF Full Text Request
Related items