Font Size: a A A

Load Balancing In Map Reduce Based On Maxdiff Histogram

Posted on:2016-08-24Degree:MasterType:Thesis
Country:ChinaCandidate:D D ZhangFull Text:PDF
GTID:2308330461951558Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years, with the rapid development of information technology, the Internet, social networks, physical information systems and other relevant technology gradually come to maturity. Internet of things, hybrid cloud computing and natural language question answering and other emerging technology are advancing by leaps and bounds. The volume of data accumulated in scientific research, electronic commerce, health care and other fields are beyond terabytes. Big data exists in all fields. It is significantly difficult to further extend the performance of traditional data processing technology. So it does not meet the needs of big data analysis. Batch processing model, such as the most popular computation architecture Map Reduce, gains more and more attention from the academia and government.Map Reduce computation model fully utilizes distributed computing and storage resources, distribute data split and tasks to thousands of low-cost physical nodes and provide massive storage capacity and parallel computing capability. As the distributed computing framework for dealing with massive data on large-scale cluster, Map Reduce has good performance and high reliability and it is cost-effective as well. Therefore Map Reduce has attracted much concern in big data environments. Load balancing, as the key factor affecting the throughput of the cluster, has been a hot research topic of the academia all the time.The uniformity of the data distribution significantly influences the performance of Map Reduce in distributed computing environment. However, the current Map Reduce implementation distributes data for Reduce phase with Hash random partition and this will lead to the imbalance of Reduce node and thus result in lower throughput.In this thesis, we propose a load balancing method based on Maxdiff histogram MHLB. First, MHLB estimates the distribution of the intermediate results by Maxdiff histogram based on preprocessing technology. Then MHLB proposes an improved partition method based on greedy strategy. The balance of data items after shuffle phase will be realized. The experimental results show that in homogeneous resource environments, this method will improve the degree of load balance of the cluster and shorten the job duration.
Keywords/Search Tags:MapReduce, Data Skew, Histogram, Data Partition
PDF Full Text Request
Related items