Font Size: a A A

Earch On Data Skew In Join Base On Hadoop

Posted on:2015-01-17Degree:MasterType:Thesis
Country:ChinaCandidate:L WuFull Text:PDF
GTID:2268330428499748Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Now, with the rapid development of science and technology, the growing demand for big data is processing. Hadoop map/reduce has been used more and more in distributed data processing as a parallel data processing framework. Map/reduce is a efficient, scalable, highly fault-tolerant parallel programming model. And it is very easy to use. The join is an important operator in data processing which has been studied a lot in traditional database. Because of the map/reduce framework itself, it can not support join operator perfectly. There are many join algorithms for map/reduce, but most of them have problem in handling data skew problem. Data skew will result in uneven distribution of data and reduce the efficiency of the distributed algorithm.Firstly, this article introduces the impact of data skew problem. Secondly, we propose departing join for the join of two tables. This algorithm takes different treatment for skew data and non-skew data base on the idea of divide and rule. We combine traditional join algorithm, broadcast join and other algorithm, fix the problem of unbalance overload with data skew problem. Then we deal with the data skew problem in multi-way equal join. We use range hash and multi-way equal join algorithm in a map/reduce work to uniform the data load and eliminate the influence of data skew. At last, we conduct a series of experiments based on algorithms. The practicality of our algorithms is proved according to the comparison between our algorithms and traditional algorithms.
Keywords/Search Tags:big data, map/reduce, join algorithm, data skew
PDF Full Text Request
Related items