Earch On Data Skew In Join Base On Hadoop

Posted on:2015-01-17

Degree:Master

Type:Thesis

Country:China

Candidate:L Wu

Full Text:PDF

GTID:2268330428499748

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Now, with the rapid development of science and technology, the growing demand for big data is processing. Hadoop map/reduce has been used more and more in distributed data processing as a parallel data processing framework. Map/reduce is a efficient, scalable, highly fault-tolerant parallel programming model. And it is very easy to use. The join is an important operator in data processing which has been studied a lot in traditional database. Because of the map/reduce framework itself, it can not support join operator perfectly. There are many join algorithms for map/reduce, but most of them have problem in handling data skew problem. Data skew will result in uneven distribution of data and reduce the efficiency of the distributed algorithm.Firstly, this article introduces the impact of data skew problem. Secondly, we propose departing join for the join of two tables. This algorithm takes different treatment for skew data and non-skew data base on the idea of divide and rule. We combine traditional join algorithm, broadcast join and other algorithm, fix the problem of unbalance overload with data skew problem. Then we deal with the data skew problem in multi-way equal join. We use range hash and multi-way equal join algorithm in a map/reduce work to uniform the data load and eliminate the influence of data skew. At last, we conduct a series of experiments based on algorithms. The practicality of our algorithms is proved according to the comparison between our algorithms and traditional algorithms.

Keywords/Search Tags:

big data, map/reduce, join algorithm, data skew

PDF Full Text Request

Related items

1	Join Query Optimization For Large-Scale Data Based On New Computing Architecture
2	Research Of Join Algorithm With Skew Data On Mapreduce
3	Research On Optimal Reduce Placement Algorithm Based On Data Skew
4	Optimization And Research On Reduce Task Scheduling Strategy And Data Skew On Hadoop
5	Research And Implementation Of Skew Join Optimization Technology On MyCat
6	Research On Some Key Technologies Of Parallel Processing For Big Data Based On Map Reduce
7	Research And Optimization Of Join Algorithm Based On MapReduce
8	Join Algorithm Research Based On MapReduce
9	Research And Implementation Of Multi-Way Join Framework Based On Map-Reduce
10	Research On Optimization For Multi-way Join In A Map-Reduce Environment