Multiple Datasets Joins Based On Time Cost Evaluation Model For Distributed System

Posted on:2016-04-28

Degree:Master

Type:Thesis

Country:China

Candidate:L B Xia

Full Text:PDF

GTID:2348330479453407

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Distributed computing provides a new platform for big data analysis and processing. Map Reduce is an important programming model, it is often used for processing large datasets in a parallel or distributed computing environment. However, because of some disadvantages of this programming model, it is inefficient to perform join operations in Map Reduce when mulitiple datasets are involved. How to improve the existing methods which use Map Reduce to process multiple datasets joins, has significance to improve the efficiency of data query and analysis.Considering the time cost of join processing, sorting and compression in a MapReduce job, a time cost evaluation model is extended for calculating the time cost of a Map Reduce job. And in order to make the model more useful, how to estimate the amount of join results by probability distribution function is presented.A new method is designed to deal with the problem of multi-join by the time cost model, greedy strategy and dynamic programming. Firstly, some equi-joins are processed to reduce the scale of the unequi-join; next, all unequi-joins are processed by multi-way theta-join or Two MRJs(Map Reduce Jobs); at last, the final task is decomposed into several subtasks according to the time cost, and optimal schemes for each task are obtained by greedy and dynamic programming. The new method reduces the cost of processing task by breaking down the task and choosing the appropriate join methods for subtasks.We conducted extensive experiments using Hadoop to prove that the new method can improve the efficiency of the join operation of task execution, and it is more efficient than those common methods such as Hive and Pig.

Keywords/Search Tags:

distributed computing, join plan, time cost evaluation model, greedy, dynamic programming

PDF Full Text Request

Related items

1	Research On Optimization Methods Of Multiple Datasets Join For Distributed System
2	Research On Dynamic Programming Based Join Tree Generation Algorithms
3	Implementation And Evaluation Of Big Data Parallel Join Algorithms
4	Research On Distributed Management And Evaluation Technology Of Materials In Large-scale Module Construction Process
5	DV-Join:A Novel Method Based On Dynamic Indexing And VMC-Filtering For Similarity Join
6	The Cost Model And Its Optimization Based On Distributed System In Moving Object Connection Operation
7	A Query Optimization Of Embedded Mobile Real-time DBMS Based On Cost Model
8	The Research And Implementation Of Logistics Vehicle Dynamic Scheduling Method Based On ADP
9	The Optimization Of Spark SQL Based On Cost
10	Research On Improvement Of Similarity Join In MapReduce