The Research Of Skew With Sampling Technique In MapReduce

Posted on:2014-02-17

Degree:Master

Type:Thesis

Country:China

Candidate:Y J Geng

Full Text:PDF

GTID:2248330398452104

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of the Internet, information’s explosive growth produces big data everyday. It is a huge challenge for storage and analysis of big data. In recent years, cloud computing which is a new computing model has attracted much attention since its birth. Some giant IT service providers have made the cloud computing as the primary development strategy. They have also proposed their own platforms and services, moreover, there have been many remarkable achievements.MapReduce, as one of the large-scale and parallel processing models, has been widely applied in cloud computing environment. With the characteristics of easy to use, high scalability and fault tolerance, MapReduce has been applied to many fields. However, it has also some problems. It can not handle skewed data effectively. When data MapRedcue is handling is unevenly distributed, it can cause some tasks running slower than other tasks. Since the execution time of the MapReduce job is determined by the slowest task, the execution time of it increases which means system performance degradation. In this paper the skew problem in MapReduce is studied and a processing method is developed.In this paper, we address the problem of how to efficiently partition the MapReduce intermediate keys to balance the workload of all reducers when skewed data exist. The main jobs in this paper are as follows. Firstly, Count frequency distribution of all the keys in the input file. However, it causes overhead if we count all the data of input file, so this paper uses a sampling technique to estimate the distribution of the key frequency, estimate the overall distribution and then make a partition scheme in advance. We make the operation of counting frequencies distribution to a MapReduce job. And the theoretical analysis of the sampling is given in this paper. It proves that samples can be instead of the source input files for computing the frequency of all keys. Secondly, according to the calculated distribution of all keys frequencies, two kinds of partition schemes are proposed includes cluster combination and cluster split. The former is more efficient when skew degree of data is smaller and the latter is more efficient when skew degree of data is larger. Finally experiments proved that using the sampling technique to process a small part of the data could quickly produce the approximate key distribution, and the two partition methods can achieve rapid execution time and make the Reducers have a good load balancing.

Keywords/Search Tags:

MapReduce, Data Skew, Sampling, Load Balancing

PDF Full Text Request

Related items

1	The Research Of Skew With Sampling Technique In MapReduce
2	Research On Lightweight Load Balancing Under Mapreduce
3	Research And Strategy On Data Skew Problem Based On MapReduce
4	Load Balancing Algorithm Based On Data Skew Of MapReduce
5	An Intermediate Data Placement Algorithm For Load Balancing In Spark Computing Environment
6	A Research Of Load Balancing Algorithms For Data Skew In Spark
7	Join Algorithm Research Based On MapReduce
8	Key Value Based Algorithm For Solving Reduce Load Imbalance In Mapreduce
9	Load Balancing In Map Reduce Based On Maxdiff Histogram
10	Optimization And Research On Reduce Task Scheduling Strategy And Data Skew On Hadoop