Font Size: a A A

Research On Optimization Methods Of Multiple Datasets Join For Distributed System

Posted on:2017-01-28Degree:MasterType:Thesis
Country:ChinaCandidate:R X LiFull Text:PDF
GTID:2348330503989888Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of information technology, how to deal with massive data becomes a hot research topic. Map Reduce is a framework for processing parallelizable problems. Because of its high ability to process mass data, convenience and strong expansibility, it has become one of the preferred choices when dealing with big data. When facing the important and common join operations, Map Reduce has a good performance in single-attribute equal-join operation due to its characteristics. But it is not good at dealing with theta joins of multiple datasets as it can not make joining plan. So the research on theta join of multiple datasets has important significance to improve the efficiency of the processing of big data.The time cost evaluation model is perfected as refining the cost of calculating, sorting, combining and data exchanging between memory and disk. Also the distribution fitting operation, which can accurately obtain the distribution form of the datasets to be joined, is added to the model, so the scale of the result set can be estimated more precisely. The estimated cost of the join of datasets can be more accurate using this evaluation model, therefor the more efficient and proper joining plan can be made. Finally the joining efficiency has been improved.A method of multiple datasets join is formulated based on join set partitioning and covering using the cost evaluation model. This method regards a join operation as a join set. First, the entire join relation is regarded as a universal set and the set is partitioned into some subsets while pruning strategy is used to decrease the amount of subsets. Second, ant colony optimization is used to get the optimal subsets which can cover the universal set, and these subsets represent the join operations which can connect all the datasets. Finally the result sets of these join operations are regarded as new datasets, and set partitioning and covering will be repeated until the final joining plan is made. As set partitioning ensures the wholeness and set covering ensures the efficiency of the joining plan, the efficiency of the whole join operation is improved.These optimization methods are compared with some existing methods and tools by various kinds of experiments. The result proves that these optimization methods have better adaptability and efficiency of multiple datasets joins than the other methods and tools.
Keywords/Search Tags:distributed framework, cost evaluation, join strategy, distribution fitting, ant colony optimization
PDF Full Text Request
Related items