Font Size: a A A

Research On Optimization For Multi-way Join In A Map-Reduce Environment

Posted on:2013-01-13Degree:MasterType:Thesis
Country:ChinaCandidate:Y R ZhaoFull Text:PDF
GTID:2298330467978166Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the social development and decline of the cost to produce data, the amount of data generated by human is in exponential growth. Therefore, the analysis of massive data has being gradually paid closer attention. Distributed computing is more and more widely used in the analysis of massive data, because the centralized data warehouse cannot provide good scalability in the face of the explosive growth of data. While, by using the distributed computing, we can assign a task to multiple nodes to get better performance. MapReduce framework proposed by google provides a good framework for distributed computing. MapReduce has been widely recognized as an efficient tool for large-scale data analysis. Multi-join will be often used in massice data analysis. There are mainly two methods to implement the multi-way joins on map-reduce, cascade and replicated methods. But both of them have its shortcoming, and in some cases their effciency is poor.This thesis proposes a new method which mixs the cascade and replicated methods.In the process of implement the multi-way join, we implement some joins by using cascade method, and some joins use replicated method. In this way we can get better performance.This approach will produce a variety of execution plan, and we need to find which plan needs the least cost. So this thesis proposes two optimization algorithms to find the optimal excution plan. The first one is exhaustive optimization algorithm. In this algorithm, we use the connection graph partitioning and recursive to traverse all possible excution plans, and find the optimal one. We also provide a pruning of the search space, but the complexity of the search space is very high. The second algorithm is based on greedy, we find the replicated joins that have the biggest cost saving base on the best plan of cascade method. By this algorithm, we can find an approximate optimal execution plan with lower complexity.The experiments have shown that the CRMJ method we proposed is more efficiency than the conventional cascade method. At the same time, we compare the two optimization algorithm. The experiment shows that the exhaustive optimization algorithm is better for small number of join tables and the greedy algorithm is better for many join tables.
Keywords/Search Tags:multi-way join, map-reduce, replicated join, cascade join, query optimization
PDF Full Text Request
Related items