The Research Of Load Balancing In Mapreduce Based On Sampling Estimation

Posted on:2015-03-21

Degree:Master

Type:Thesis

Country:China

Candidate:H F Li

Full Text:PDF

GTID:2268330428481798

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

From the emergence of cloud computing to its current development so mature should be attributed to the realistic environment promoted and the development of Internet technology. First, the rapid spread of the Internet in every walk of life, which makes the amount of data is the explosive growth. The International Data Corporation’s research statistics showed that the amount of global data in2010was1.3ZB, while in the second year it had increased about0.6ZB, which is equivalent to more than200GB of global data generated per person, after which growth is more rapid. Data volume level is already not the era of TB, a greater level of data is coming. Then how to store and handle these massive data is a major problem that faced by the people. Second, because the cloud has an overwhelming cost advantage, companies regard cloud computing technology as key strategic, but also makes the development of cloud computing extremely fast.MapReduce has proven that can provide effective and powerful parallel processing method on it, by this model, some programmers who never operating parallel program process also can be used perfect, but MapReduce itself still has shortcomings that is MapReduce data skew in operation generally exist. When data distribution of the large-scale data sets is not balanced, it will make the running nodes load imbalance, the individual task would be the "laggards", resulting in the whole system performance degradation, the whole job running time will be extended also.In this paper, the research question is how to solve the MapReduce data skew case efficiently in the Reduce stage of the MapReduce program running, some of the existing methods are heterogeneous, which reduce the synchronization of MapReduce, so this paper mainly adopts the "first pretreatment, later division "strategy, to make Reducer achieve load balancing. First, using a two level sampling technique to count the key distribution of the data set, and then work out in advance allocation strategy based on this distribution.it improve the inadequate of default Hash partition. To use two division strategy:the small clusters combination and the big cluster segmentation. The case of the small clusters combination apply which data skew is not serious, but the big clusters segmentation in the case of data shew degree seriously outstanding performance. Experiments show that the load balancing in MapReduce based on a two level sampling technology can achieve the Reduce stage better load balancing, and thus improve the performance of system.

Keywords/Search Tags:

Cloud Computing, MapReduce, Sampling, Data Skew, Load Balancing

PDF Full Text Request

Related items

1	The Research Of Skew With Sampling Technique In MapReduce
2	An Intermediate Data Placement Algorithm For Load Balancing In Spark Computing Environment
3	Research On Lightweight Load Balancing Under Mapreduce
4	Research And Strategy On Data Skew Problem Based On MapReduce
5	Load Balancing Algorithm Based On Data Skew Of MapReduce
6	The Research Of Load Balancing In Mapreduce Based On Data Locality
7	Research And Implementation Of Local Priority Scheduling Algorithm Based On Mapreduce For Massive Data
8	Research On Apache Spark Distributed Parallel Computing Framework Optimization Technology
9	Research On Load Balancing In The Construction Of Cloud Computing Data Center
10	Research And Implementation Of A Load Balancing Technology Based On Data Correlation In Cloud Computing