Font Size: a A A

Design And Optimization Join Algorithms Based On Map Reduce

Posted on:2016-06-28Degree:MasterType:Thesis
Country:ChinaCandidate:L HuFull Text:PDF
GTID:2308330479983313Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
The Map Reduce framework –a parallel processing paradigm—is widely being used for large scale distributed data processing because of high fault tolerance,high usability, scalability.Map Reduce can perform typical relational database operations like selection,aggregation,and projection etc.However,binary relational like joint,Cartesian product,and set operations are difficult to implement with Map Reduce.Map Reduce can process data streams easily but does not provide direct support for handling multiple input data streams.Thus the binary relational join operator does not have effieient implementation in the Map Reduce framework.Some implementations of the join oprerator exist for the Hadoop distribution of the Map Reduce framework.However,these implementations do not perform well.Thus,how to improve the Map Reduce join algorithm is an urgent proplem that should be solved.First,through the analysis of the shortage of a general two-way Reduce side join,the thesis proposed a optimized partition join algorithm which is based on index files.The idea of this algorithm is that partition the two input files before they join,and create a index file for the small input files,and store it into the HDFS(Hadoop Distributed File System).When two tables join in the map side,each map takes part of larger dataset as input,and before reading,it can get the partition ID of the split.By looking up the index file,it can get relative partition with same partition ID,then fetch and load it into memory.So in this way,it only loads part of the smaller dataset into memory.It still makes full use of memory of data nodes.Second,through the analysis of the shortage of general mutli-way Join algorithm in the Hadoop Mapreduce model,a optimized partion policy is proposed in this thesis.The policy main idea is that each key/value pairs can be sended to the many reduce side.Through the policy, there maybe have a reduce side that satisfy the requirement of join can join the multiple table,and It can reduce the number of Map Reduce Job which is used to perform the Multi-tables join.Besides,the thesis proposed a optimized operation that create Bit-Map files to reduce the data number before the partition excution.Thus it can reduce the transform cost,and improve the efficiency of the multi-tables join tasks through these optimized operation above.Finally,through a large number of experiments,the optimized policy proposed in this thesis can be verified.we can see that the optimized method based on the Map Reduce can reduce a lot of cost in shuffle stage,and improve the efficiency of the system to perform join tasks,and improve the system performance.
Keywords/Search Tags:HDFS, MapReduce, Join query and optimiazation, Partitioning optimization
PDF Full Text Request
Related items