Design And Optimize Big-Data Join Algorithms Using MapReduce

Posted on:2015-01-21

Degree:Doctor

Type:Dissertation

Country:China

Candidate:C C Zhang

Full Text:PDF

GTID:1268330428484368

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

In recent years, with the wide popularity of Internet technology, represented by blog social networks and other new applications has been widely used, at the same time, along with the rapid development of cloud computing technology, the data in the Internet is growing at an unprecedented speed and accumulation, big data has entered our life.MapReduce introduced by Google in2004has already been widely used in the field of big data processing. Big Internet companies such as Yahoo!, Face-book, Amazon, are using MapReduce to deal with big data related problems. At the same time, the academia has made tremendous contributions to the relevant MapReduce algorithms and effectively promotes the development of the MapRe-duce.In this paper, in the related fields of study and summary on the basis of exist-ing achievements, around the optimization of the join algorithm using MapReduce, we mainly carried out the following research work:Firstly, this paper presents an efficient algorithm building a Maxdiff his-togram using MapReduce, including the exact and approximate algorithms. Maxd-iff histogram can accurately evaluate the data distribution, such as it can provide data skew situation or the join selectivity of the datasets, which is a basic work for later optimization of the join algorithms using MapReduce.Secondly, this paper presents a join algorithm based on BloomFilter, whose core idea is to use BloomFilter to decrease the network overhead between the map and reduce phases so as to improve the efficiency. First of all, an efficient algorithm building a BloomFilter for a dataset is proposed; secondly, join algorithms based on BloomFilter are proposed, including two-way and multi-way equi-joins; lastly, based on the disk I/O and network I/O, the cost model of join algorithms are established, in order to select the optimal join plan.Thirdly, this paper presents two-way and multi-way joins algorithms for the skew data. Aiming at two-way joins, we optimize the join efficiency of skew data, which contains one or several data appearing excessively. Aiming at multi-way joins, we use range partition to optimize the join efficiency in a single MapReduce job.Lastly, this paper proposes an random algorithm SEJ for implementation of multiway theta-joins in a single MapReduce job. To minimize the communication cost between map and reduce phases, Lagrangian method is used to compute the approximate fragments of each relation. When input datasets are skew, compared with other existing algorithms, SEJ is more stable and efficient due to its even distribution of datasets. Moreover, a dynamic programming algorithm is designed to partition multi-way theta-joins into subgroups and select a best MapReduce implementation by combining these subgroups for this problem.This paper mainly focuses on the optimization of the join algorithms using MapReduce, around the equi-join, skew join, and θ join, the algorithms presented can improve the execution efficiency, which also can be used for reference to the subsequent researchers in the future.

Keywords/Search Tags:

MapReduce, histogram, optimization of euqi-join algorithm, dataskew, optimization of θ-join algorithm

PDF Full Text Request

Related items

1	Research And Optimization Of Join Algorithm Based On MapReduce
2	Research On Key Technology Of Optimization For Multi Join Based On Hadoop
3	Design And Optimization Join Algorithms Based On Map Reduce
4	Optimization Of Database Join Algorithms On DRAM/NVM-Based Hybrid Memory
5	Research On Join Query Optimization Algorithm In Distributed Database
6	Research And Design Of KNN-join Algorithm Based On MapReduce
7	Optimum Design Of Table Join Algorithm Based On MapReduce
8	Join Algorithm Research Based On MapReduce
9	Research And Implementation Of The Aggregate-Join Query Optimization Approach Based On Mapreduce
10	Research On Join Query Processing And Optimization Techniques In Cloud Computing Environment