Font Size: a A A

Research And Implementation Of MapReduce-Based Heuristic Multi-Join Optimization Under Hybrid Storage

Posted on:2016-04-07Degree:MasterType:Thesis
Country:ChinaCandidate:L L XingFull Text:PDF
GTID:2298330452466430Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Parallel distributed programming framework–MapReduce has become the most popular toolin mass data analysis with the advantages of simplicity, extendibility and fault tolerance, whichresults in its widely used in the Internet field. However, due to its own limitation, the performanceof MapReduce is very low when it is adopted to perform Multi-Join. Therefore, it is necessary todesign an improved method for Multi-Join operation under MapReduce framework.Firstly, this thesis describes the significance of multi-join query optimization in the massivedata environment. Following by the detailed introduction of the MapReduce programmingframework and research backgrounds of the join algorithm and multi-join query optimizationstrategies, this thesis analyzes three existed methods for multi-join query, which are multipletwo-table join, Replicated Join and Grouping join. Then a novel MapReduce-based HeuristicMulti-Join Optimization (MHMO) strategy is proposed. Its main idea is that considered therelation between join method and the join model and recommend different execution strategies fordifferent join patterns. This strategy aimed at consuming the minimum cost at the same time togenerate a suboptimal query plan. Particularly, as to hybrid join, in order to obtain the optimalexecution plan, we further define a cost estimation model. Moreover, in this thesis, the efficiencyof multi-join query optimization is further improved in two aspects: one is the real-timeoptimization strategy that determines the multi-join task’s subtask join order and join executionalgorithm in real time, which improves the cost estimation accuracy. The other one is thecombination of hybrid storage PageFile which is based on the designed page structure, which canavoid the full table scan as well as the full column scan.Subsequently, the thesis provides the detailed algorithm design and implementation of theproposed MHMO strategy, the main modules include:1) MapReduce-based Heuristic Multi-JoinOptimization part.2) Hybrid-join optimization part.3) Cost estimate model part.4) Real-timeoptimization part.Finally, in order to verify the performance of the proposed MHMO strategy, this thesis has done some extensive experiments on top of the Hadoop platform. Experiments are carried out ondata warehouse benchmark TPC-H datasets. The query’s execution results that performed underMHMO strategy have been compared with Hive and Pig System. The experimental results haveshown that MHMO has a shorter response time and good balancing capabilities, which improvesthe overall performance of MapReduce multi-join queries.
Keywords/Search Tags:MapReduce, Row-column Hybrid Storage, Multi-Join, MHMO, CostEstimation Model
PDF Full Text Request
Related items