Font Size: a A A

The Research Of Skew With Sampling Technique In MapReduce

Posted on:2014-02-17Degree:MasterType:Thesis
Country:ChinaCandidate:Y J GengFull Text:PDF
GTID:2248330398452104Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, information’s explosive growth produces big data everyday. It is a huge challenge for storage and analysis of big data. In recent years, cloud computing which is a new computing model has attracted much attention since its birth. Some giant IT service providers have made the cloud computing as the primary development strategy. They have also proposed their own platforms and services, moreover, there have been many remarkable achievements.MapReduce, as one of the large-scale and parallel processing models, has been widely applied in cloud computing environment. With the characteristics of easy to use, high scalability and fault tolerance, MapReduce has been applied to many fields. However, it has also some problems. It can not handle skewed data effectively. When data MapRedcue is handling is unevenly distributed, it can cause some tasks running slower than other tasks. Since the execution time of the MapReduce job is determined by the slowest task, the execution time of it increases which means system performance degradation. In this paper the skew problem in MapReduce is studied and a processing method is developed.In this paper, we address the problem of how to efficiently partition the MapReduce intermediate keys to balance the workload of all reducers when skewed data exist. The main jobs in this paper are as follows. Firstly, Count frequency distribution of all the keys in the input file. However, it causes overhead if we count all the data of input file, so this paper uses a sampling technique to estimate the distribution of the key frequency, estimate the overall distribution and then make a partition scheme in advance. We make the operation of counting frequencies distribution to a MapReduce job. And the theoretical analysis of the sampling is given in this paper. It proves that samples can be instead of the source input files for computing the frequency of all keys. Secondly, according to the calculated distribution of all keys frequencies, two kinds of partition schemes are proposed includes cluster combination and cluster split. The former is more efficient when skew degree of data is smaller and the latter is more efficient when skew degree of data is larger. Finally experiments proved that using the sampling technique to process a small part of the data could quickly produce the approximate key distribution, and the two partition methods can achieve rapid execution time and make the Reducers have a good load balancing.
Keywords/Search Tags:MapReduce, Data Skew, Sampling, Load Balancing
PDF Full Text Request
Related items