Algorithm To Deal With The Problem Of Data Skew In MapReduce Model

Posted on:2018-01-08

Degree:Master

Type:Thesis

Country:China

Candidate:G Z Li

Full Text:PDF

GTID:2348330512989816

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

In the age of Big Data, a large quantity of various data are being generated by all kinds of electronic equipments (personal computers, mobile smart phones, servers and etc.) every day. MapReduce which has been a well-known programming model processes big data in large scale clusters due to its scalability, availability and reliability.However, great challenges also have been brought to MapReduce programming model while handling the big data. To mitigate the process time of the clusters through minimizing the makespan is one of the key challenges. Scientists have found that data skew is one of the reasons that makes total makespan much longer, although there are some research teams proposing solutions for the problem from different perspectives,but all of these methods are based on the Hadoop 1.X, there is no reliable solution for Hadoop 2.X (Hadoop-Yarn) platform.In view of the above questions, this thesis mainly studies on the problem of data skew in MapReduce model. It deeply analyses of the shortcomings of proposed solutions for that problem in MapReduce model, discusses and researches on how to reduce the makespan of the Hadoop cluster through reducing the degree of data skew in details .The main contents are as followes:Firstly, In this thesis, six algorithms (Hadoop default speculative execution,SkewReduce, SkewTune, iShuffle, LEEN and LIBRA) are analyzed and compared in terms of architectures and key features,core algorithms, performance metrics,and evaluation methods for understanding and knowing well the latest research of data skew problem;Secondly, aiming at the data skew problem of a batch Hadoop jobs, the offline HScheduler based on RJA(Revised Johnson Algorithm) algorithm and the online OnHScheduler algorithm are proposed to cut down the impact of data skew on the total makespan by resource balancing and dynamically adjusting the Hadoop resource allocation. The reliability of the algorithms are proved by calculating the competitive ratio of the algorithms;Next, in order to solve the problem of data skew in a single Hadoop job, this thesis proposes a method to solve the problem of data skew based on Hadoop-Yarn called YarnTune. By detecting data skew in the tasks firstly, then the effect of data skew is eliminated by the method, and the total makespan is reduced, the overall performance of Hadoop is improved.Finally, the data is tested with Hadoop platform, and the results of the original system and the above algorithm are compared. It is verified that the algorithm and design in this thesis can effectively reduce the influence of data skew on the total makespan.

Keywords/Search Tags:

MapReduce, data skew problem, minimized makespan, HScheduler Algorithm, YarnTune

PDF Full Text Request

Related items

1	Research And Strategy On Data Skew Problem Based On MapReduce
2	Load Balancing Algorithm Based On Data Skew Of MapReduce
3	The Research Of Handling Data Skew In MapReduce Computing Model
4	Research Of Join Algorithm With Skew Data On Mapreduce
5	The Research Of Skew With Sampling Technique In MapReduce
6	Research On Resource-aware Skew Mitigation For Mapreduce
7	Research Of MapReduce Data Skew And Task Scheduling In Heterogeneous Environments
8	Research On The Clustering Algorithm Of Parallel Partition Based On MapReduce
9	Two Types Of MapReduce Job Sorting Problems In The Same Order
10	Research And Optimization Of Join Algorithm Based On MapReduce