Font Size: a A A

Join Algorithm Research Based On MapReduce

Posted on:2017-07-23Degree:MasterType:Thesis
Country:ChinaCandidate:L M HeFull Text:PDF
GTID:2348330485481334Subject:Systems analysis and integration
Abstract/Summary:PDF Full Text Request
In recent years,bigdata technology has been applied to public health,clinical care,The Internet of things,social networking,social management,traditional retail business,industrial manufacturing and other industries.In the era of big data,data showing exponential growth and accumulation.Bigdata mining and analysis has been focused on by industry and academiaMapReduce as a distributed computing framework with the advantages of good scalability,fault tolerance and high availability to support massive data distributed computing,it plays an irreplaceable role in the bigdata mining and analysis processing,is also the important technology platform of Google,Alibaba and the academic research and application.Join operation is the most common operation in data analysis applications with large-scale datasets.In the MapReduce framework,if the original data set distribution uneven likely is to cause some mapper task preprocessing the data quantity is not balanced,resulting in the problem of map side tilt;tilt data set is partitioned using the default hash function will appear some reducer task data processing capacity is far more than other reducer tasks,cause reduce load tilt.In view of the MapReduce itself can not effectively deal with the join operation of data skew,in this paper MapReduce Frequecncy Classified Join algorithm was designed.Specific research contents include:First,a data classification method based on histogram was designed.Because Hadoop itself can not sense the distribution of mapper task's output data,resulting in reducer' s load is not balanced,affect the efficiency of join operation.In this paper,the intermediate results of the mapper task output is analyzed statistically based on the histogram.The entire data set is divided into three categories according to the frequency of join data set.By the distribution of data,we can design the partition function and data distribution mechanism to ensure that each reducer load balance,improve the efficiency of the join query.Second,the data distribution mechanism based on data classification was designed.In order to avoid the load unbalanced of each task,Data redistribution applies partitioning algorithm and broadcast algorithms eliminates the impact of skewed data,and hash algorithm for the non-skew data.Different data types,different data distribution mechanism,can effectively and reasonably complete the data redistribution,so as to effectively avoid the influence of data skew for the performance of the join query.Third,MapReduce Frequecncy Classified Join algorithm was designed.join operation can be completed in a single node,avoiding the cost of communications across the nodes under the MapReduce for the redistributed data,balancing the workload of each node effectively,thereby improves the efficiency of join operations in skewed data.In this paper,uses disk I/O and network transmissionon the join algorithm for quantitative analysis.Finally,test and analysis of the join algorithm based on Hadoop cluster environment by experiments.The experiments show that the data distribution is evaluated by histogram,then the data redistribution strategy is designed toadapt for skewed data sets,It can effectively improve the efficiencyof join operation based on MapReduce with good scalability.
Keywords/Search Tags:MapReduce, Data skew, Join algorithm, Load balancing
PDF Full Text Request
Related items