Optimum Design Of Table Join Algorithm Based On MapReduce

Posted on:2018-09-16

Degree:Master

Type:Thesis

Country:China

Candidate:S C Xiao

Full Text:PDF

GTID:2348330563952330

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

In recent years,the popularity of the Internet has brought the amount of data growth,the concept of big data more and more accepted by people.The potential of big data by every filed certainly,research-related analysis and processing technology has been a great deal of attention.Google's GFS and MapReduce technologies,which are highly available,make the framework the most popular big data processing tool.Hadoop is an open source implementation of Google technology,and one of the hottest items in open source projects.MapReduce programming framework is a distributed computing framework,which ensures efficient parallel coordination mechanism and high fault tolerance at the same time,to provide users with a relatively simple programming process.Query is the basic operation of data processing,and the connection frequency is the highest in the query operation.Therefore,the improvement of the connection operation of data tables is of great significance to the improvement of the performance of MapReduce framework.However,due to the computational characteristics of the distributed programming framework,there are many limitations in dealing with connection operations,and efficiency is even lower in the case of multiple table connections.In this paper,we propose a two-table join algorithm which uses the shared information to reduce the network transmission of intermediate data.On this basis,we propose to improve the multi-task concurrency by using the pipeline model to optimize the multi-task connection.In this paper,an improved optimization algorithm based on distributed cache mechanism is proposed for the lack of RSJ based on MapReduce framework.The idea of optimizing the algorithm is before the RSJ algorithm for table join.Preprocessing refers extracting connection attribute values from one of join table and compressed into smaller "background" data storage to a small file by Bit-map,and then transmission this small file to all nodes through the Distributed Cache mechanism.At this time,if using RSJ can filter out much data in other table which not satisfy the connection condition by "background" data at Map stage.Thus reducing the output data from mapper to achieve the effect of optimization.For solving the problem of coodinaing multi-task,this paper introduces a pipeline model to optimize the algorithm,in order to make MapReduce can perform join tasks in parallel.Further optimization of the multi-table connection algorithm.By using a task scheduler,multiple tasks can be concurrency.The rational use of the gap between the tasks to enhance the system's parallel capacity.In addition to that,the selection strategy of connection order is also studied.By adjusting the order of join between multiple tables to achieve the purpose of reducing the intermediate results...

Keywords/Search Tags:

Join, MapReduce, Hadoop, Pipeline, DistributedCache

PDF Full Text Request

Related items

1	Join Processing And Optimizing On Large Data Sets Based On Hadoop Framework
2	Research On String Similarity Join Method Based On Hadoop Platform
3	Research On Key Technology Of Optimization For Multi Join Based On Hadoop
4	Hadoop Based Efficient Join Algorithm Research On GPU
5	Research On Improvement Of Similarity Join In MapReduce
6	Research And Optimization Of Join Algorithm Based On MapReduce
7	Research And Design Of KNN-join Algorithm Based On MapReduce
8	Join Method Research Based On MapReduce
9	Top-k Join Query Processing Method Based On MapReduce
10	Design And Optimize Big-Data Join Algorithms Using MapReduce