Research Of Join Algorithm With Skew Data On Mapreduce

Posted on:2017-01-25

Degree:Master

Type:Thesis

Country:China

Candidate:K Liu

Full Text:PDF

GTID:2308330482479892

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

The framework of MapReduce which was proposed by Google has developed into one of the most popular parallel computing frameworks. Join operation is a kind of operation which is very important in data processing, and is also a time-consuming operation. However, MapReduce can’t support the join operation well, so the research of join algorithm in MapReduce is an important content of big data field. Most of the current researches about the join algorithm concentrate on the fact that data in join operation is balanced, but in actual life most of the data is distributed unevenly. The unbalanced data dealt with by MapReduce framework will lead to the result that the processing time of Reduce task is highly different, and it will reduce the utilization rate of computer resources.Based on the above background, for dual equal join operation, this paper puts forward some join algorithms based on sampling techniques and data partition. Firstly, this paper extracts a certain volume of sample by reservoir sampling method in MapReduce, and then computes the I/O cost according to the distribution of the sample, and divides the data for each cluster based on the I/O cost. This paper proposes cluster combination join algorithm. The core idea is to always choose the most costly one of the cluster and assign it to the current Reduce task node with the minimum cost. For highly skewed data, this paper proposes cluster split combination join algorithm to split the large clusters and share them to all Reducer tasks, to achieve load balancing of all Reducer tasks and improve the efficiency of MapReduce jobs. For multiple equal join operation, this paper proposes range replication join algorithm to complete skewed data join operation in a single MapReduce job. Experimental results show that the proposed algorithms in dealing with skewed data have good performance.

Keywords/Search Tags:

Cloud Computing, Massive Data, MapReduce, Join Algorithm, Data Skew

PDF Full Text Request

Related items

1	The Research Of Parallel Clustering Algorithm Of Massive Data In Cloud Computing Environment
2	Research And Optimization Of Join Algorithm Based On MapReduce
3	Join Algorithm Research Based On MapReduce
4	Join Method Research Based On MapReduce
5	Design And Implementation Of Similarity Self - Connection Algorithm For Massive Data Sets Based On MapReduce
6	The Research Of Handling Data Skew In MapReduce Computing Model
7	Research And Implementation Of Local Priority Scheduling Algorithm Based On Mapreduce For Massive Data
8	Earch On Data Skew In Join Base On Hadoop
9	Performance Optimization And Applications Of MapReduce In Cloud Computing
10	The Research Of Scheduling Algorithms For Performance And Energy Consumption Under The Condition Of Data Skew