Font Size: a A A

Design And Optimize Big-Data Join Algorithms Using MapReduce

Posted on:2015-01-21Degree:DoctorType:Dissertation
Country:ChinaCandidate:C C ZhangFull Text:PDF
GTID:1268330428484368Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In recent years, with the wide popularity of Internet technology, represented by blog social networks and other new applications has been widely used, at the same time, along with the rapid development of cloud computing technology, the data in the Internet is growing at an unprecedented speed and accumulation, big data has entered our life.MapReduce introduced by Google in2004has already been widely used in the field of big data processing. Big Internet companies such as Yahoo!, Face-book, Amazon, are using MapReduce to deal with big data related problems. At the same time, the academia has made tremendous contributions to the relevant MapReduce algorithms and effectively promotes the development of the MapRe-duce.In this paper, in the related fields of study and summary on the basis of exist-ing achievements, around the optimization of the join algorithm using MapReduce, we mainly carried out the following research work:Firstly, this paper presents an efficient algorithm building a Maxdiff his-togram using MapReduce, including the exact and approximate algorithms. Maxd-iff histogram can accurately evaluate the data distribution, such as it can provide data skew situation or the join selectivity of the datasets, which is a basic work for later optimization of the join algorithms using MapReduce.Secondly, this paper presents a join algorithm based on BloomFilter, whose core idea is to use BloomFilter to decrease the network overhead between the map and reduce phases so as to improve the efficiency. First of all, an efficient algorithm building a BloomFilter for a dataset is proposed; secondly, join algorithms based on BloomFilter are proposed, including two-way and multi-way equi-joins; lastly, based on the disk I/O and network I/O, the cost model of join algorithms are established, in order to select the optimal join plan.Thirdly, this paper presents two-way and multi-way joins algorithms for the skew data. Aiming at two-way joins, we optimize the join efficiency of skew data, which contains one or several data appearing excessively. Aiming at multi-way joins, we use range partition to optimize the join efficiency in a single MapReduce job.Lastly, this paper proposes an random algorithm SEJ for implementation of multiway theta-joins in a single MapReduce job. To minimize the communication cost between map and reduce phases, Lagrangian method is used to compute the approximate fragments of each relation. When input datasets are skew, compared with other existing algorithms, SEJ is more stable and efficient due to its even distribution of datasets. Moreover, a dynamic programming algorithm is designed to partition multi-way theta-joins into subgroups and select a best MapReduce implementation by combining these subgroups for this problem.This paper mainly focuses on the optimization of the join algorithms using MapReduce, around the equi-join, skew join, and θ join, the algorithms presented can improve the execution efficiency, which also can be used for reference to the subsequent researchers in the future.
Keywords/Search Tags:MapReduce, histogram, optimization of euqi-join algorithm, dataskew, optimization of θ-join algorithm
PDF Full Text Request
Related items