Font Size: a A A

Research Of Join Algorithm With Skew Data On Mapreduce

Posted on:2017-01-25Degree:MasterType:Thesis
Country:ChinaCandidate:K LiuFull Text:PDF
GTID:2308330482479892Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The framework of MapReduce which was proposed by Google has developed into one of the most popular parallel computing frameworks. Join operation is a kind of operation which is very important in data processing, and is also a time-consuming operation. However, MapReduce can’t support the join operation well, so the research of join algorithm in MapReduce is an important content of big data field. Most of the current researches about the join algorithm concentrate on the fact that data in join operation is balanced, but in actual life most of the data is distributed unevenly. The unbalanced data dealt with by MapReduce framework will lead to the result that the processing time of Reduce task is highly different, and it will reduce the utilization rate of computer resources.Based on the above background, for dual equal join operation, this paper puts forward some join algorithms based on sampling techniques and data partition. Firstly, this paper extracts a certain volume of sample by reservoir sampling method in MapReduce, and then computes the I/O cost according to the distribution of the sample, and divides the data for each cluster based on the I/O cost. This paper proposes cluster combination join algorithm. The core idea is to always choose the most costly one of the cluster and assign it to the current Reduce task node with the minimum cost. For highly skewed data, this paper proposes cluster split combination join algorithm to split the large clusters and share them to all Reducer tasks, to achieve load balancing of all Reducer tasks and improve the efficiency of MapReduce jobs. For multiple equal join operation, this paper proposes range replication join algorithm to complete skewed data join operation in a single MapReduce job. Experimental results show that the proposed algorithms in dealing with skewed data have good performance.
Keywords/Search Tags:Cloud Computing, Massive Data, MapReduce, Join Algorithm, Data Skew
PDF Full Text Request
Related items