Font Size: a A A

Optimum Design Of Table Join Algorithm Based On MapReduce

Posted on:2018-09-16Degree:MasterType:Thesis
Country:ChinaCandidate:S C XiaoFull Text:PDF
GTID:2348330563952330Subject:Engineering
Abstract/Summary:PDF Full Text Request
In recent years,the popularity of the Internet has brought the amount of data growth,the concept of big data more and more accepted by people.The potential of big data by every filed certainly,research-related analysis and processing technology has been a great deal of attention.Google's GFS and MapReduce technologies,which are highly available,make the framework the most popular big data processing tool.Hadoop is an open source implementation of Google technology,and one of the hottest items in open source projects.MapReduce programming framework is a distributed computing framework,which ensures efficient parallel coordination mechanism and high fault tolerance at the same time,to provide users with a relatively simple programming process.Query is the basic operation of data processing,and the connection frequency is the highest in the query operation.Therefore,the improvement of the connection operation of data tables is of great significance to the improvement of the performance of MapReduce framework.However,due to the computational characteristics of the distributed programming framework,there are many limitations in dealing with connection operations,and efficiency is even lower in the case of multiple table connections.In this paper,we propose a two-table join algorithm which uses the shared information to reduce the network transmission of intermediate data.On this basis,we propose to improve the multi-task concurrency by using the pipeline model to optimize the multi-task connection.In this paper,an improved optimization algorithm based on distributed cache mechanism is proposed for the lack of RSJ based on MapReduce framework.The idea of optimizing the algorithm is before the RSJ algorithm for table join.Preprocessing refers extracting connection attribute values from one of join table and compressed into smaller "background" data storage to a small file by Bit-map,and then transmission this small file to all nodes through the Distributed Cache mechanism.At this time,if using RSJ can filter out much data in other table which not satisfy the connection condition by "background" data at Map stage.Thus reducing the output data from mapper to achieve the effect of optimization.For solving the problem of coodinaing multi-task,this paper introduces a pipeline model to optimize the algorithm,in order to make MapReduce can perform join tasks in parallel.Further optimization of the multi-table connection algorithm.By using a task scheduler,multiple tasks can be concurrency.The rational use of the gap between the tasks to enhance the system's parallel capacity.In addition to that,the selection strategy of connection order is also studied.By adjusting the order of join between multiple tables to achieve the purpose of reducing the intermediate results...
Keywords/Search Tags:Join, MapReduce, Hadoop, Pipeline, DistributedCache
PDF Full Text Request
Related items