Font Size: a A A

Research And Optimization Of Join Algorithm Based On MapReduce

Posted on:2017-10-24Degree:MasterType:Thesis
Country:ChinaCandidate:W Q LiuFull Text:PDF
GTID:2348330518970930Subject:Engineering
Abstract/Summary:PDF Full Text Request
Query operation is a common operation while dealing with data.But,because of the limitations of MapReduce computing model,there will occurs some problems like huge network throughput and data skew at reduce side while doing join operation,which affects the efficiency of the whole cluster.Therefore,it is particularly necessary to study about join algorithm based on MapReduce and strive to improve the performance of the algorithm.Firstly,this thesis aims at the problem of huge network throughput in two tables equi-join algorithm based on MapReduce,improve the performance of equi-join algorithm based on BloomFilter.While filtering the datasets with BloomFilter,improve the original method,filter both of the datasets with the BloomFilter,then we can filter out most of tuples which are not in the final result set and do not need to emit them to the reduce side.Secondly,this thesis aims at the problem of data skew at the reduce side caused by the default partition method of MapReduce,improve the performance of equi-join algorithm based on hash virtual re-partition.At the phase of reduce job gets the metadata from map job,use the method of cluster sampling to sample some map job and get their metadata instead of making reduce job to communicate with all of the map jobs.Thus,we can decrease the network throughput and handing time of obtaining metadata.Finally,the optimized method proposed in this thesis is verified through experiment-s.Through the analysis of experimental results.it can be seen that while both of the datasets are huge but the final result set is small,the improved algorithm of equi-join based on BloomFilter has greatly decrease the network throughput between the map side and the reduce side;the improved algorithm of equi-join based on Hash virtual re-partition has decrease the obtaining time of metadata.
Keywords/Search Tags:MapReduce, Equi-join, Network throughput, Data-skew, Sampling
PDF Full Text Request
Related items