Font Size: a A A

The Research Of Load Balancing In Mapreduce Based On Sampling Estimation

Posted on:2015-03-21Degree:MasterType:Thesis
Country:ChinaCandidate:H F LiFull Text:PDF
GTID:2268330428481798Subject:Software engineering
Abstract/Summary:PDF Full Text Request
From the emergence of cloud computing to its current development so mature should be attributed to the realistic environment promoted and the development of Internet technology. First, the rapid spread of the Internet in every walk of life, which makes the amount of data is the explosive growth. The International Data Corporation’s research statistics showed that the amount of global data in2010was1.3ZB, while in the second year it had increased about0.6ZB, which is equivalent to more than200GB of global data generated per person, after which growth is more rapid. Data volume level is already not the era of TB, a greater level of data is coming. Then how to store and handle these massive data is a major problem that faced by the people. Second, because the cloud has an overwhelming cost advantage, companies regard cloud computing technology as key strategic, but also makes the development of cloud computing extremely fast.MapReduce has proven that can provide effective and powerful parallel processing method on it, by this model, some programmers who never operating parallel program process also can be used perfect, but MapReduce itself still has shortcomings that is MapReduce data skew in operation generally exist. When data distribution of the large-scale data sets is not balanced, it will make the running nodes load imbalance, the individual task would be the "laggards", resulting in the whole system performance degradation, the whole job running time will be extended also.In this paper, the research question is how to solve the MapReduce data skew case efficiently in the Reduce stage of the MapReduce program running, some of the existing methods are heterogeneous, which reduce the synchronization of MapReduce, so this paper mainly adopts the "first pretreatment, later division "strategy, to make Reducer achieve load balancing. First, using a two level sampling technique to count the key distribution of the data set, and then work out in advance allocation strategy based on this distribution.it improve the inadequate of default Hash partition. To use two division strategy:the small clusters combination and the big cluster segmentation. The case of the small clusters combination apply which data skew is not serious, but the big clusters segmentation in the case of data shew degree seriously outstanding performance. Experiments show that the load balancing in MapReduce based on a two level sampling technology can achieve the Reduce stage better load balancing, and thus improve the performance of system.
Keywords/Search Tags:Cloud Computing, MapReduce, Sampling, Data Skew, Load Balancing
PDF Full Text Request
Related items