Font Size: a A A

Research On Complex Distance Measure Based MapReduce Similarity Join Techniques

Posted on:2015-09-23Degree:MasterType:Thesis
Country:ChinaCandidate:B LeiFull Text:PDF
GTID:2348330482956051Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Similarity join operation has been widely used for numerous applications, such as webpage duplicate detection, entity recognition, data cleaning and image retrieval. With the increasing size of data, processing similarity join on large-scale data with distributed parallel processing framework has attracted a growing number of researchers and commercial companies. Complex distance measures such as EMD and Bregman Divergence are attractive similarity measures for measuring the similarity between images or speeches, since they are more robuster and more consistenter with human vision. However, the existing similarity join algorithms based on such complex distance measures apply only to a centralized environment, which unable to meet the needs for large-scale data processing. Therefor, we study the problem that processing similarity join in large-scale data based on complex distance measure with distributed parallel processing framework MapReduce.Firstly, we present the basic EMD-BNLJ and Breg-BNLJ algorithms, which randomly and evenly divide data into different parts. The EMD-BNLJ algorithm processes Top-k self-similarity join operation in one dataset, and the Breg-BNLJ algorithm processes similarity join operation between two different datasets.Secondly, utilizing the character of dual problem of EMD, we map the original data to a one dimension space, and then a data locality preserved partition strategy is designed for reduce communication cost. Forthermore, a quantile based load balance strategy is generated by sampling and estimating computational cost, which can efficiently ensure the load balance of the algorithm. With the aforesaid techniques, we present the EMD-DLPJ algorithm, which is finally proved to be one fifth as little as EMD-BNLJ on communication cost at least, and 6.9 times as efficient as EMD-BNLJ at most by extensive experiments on real datasets.Finally, according to the character of computation on Bregman Divergence, we design a data partition strategy by dividing the original data space using VA-File, which is able to filter unnecessary distance computation, and then reduce the communication cost. Meanwhile, a prefix based grouping strategy by sampling and estimating computation cost can efficiently ensure the load balance. Then we present the Breg-VAFJ algorithm, which is finally proved to be one third as little as Breg-BNLJ on communication cost at least, and as 4.8 times as efficient as Breg-BNLJ at most by extensive experiments on multiple real datasets.In conclusion, this thesis studies the problem that processing similarity join in large-scale data based on EMD and Bregman Divergence complex distance measure with distributed parallel processing framework MapReduce. Accurately controllable data partition methods are designed to instead of the basic random even data partition method by utilizing the character of computation of complex distance, which are able to reduce the communication cost of MapReduce job. Meanwhile, corresponding load balance strategies are also presented by estimating the computational cost on sampled data set, which can efficiently ensure the load balance of the MapReduce job. Finally, extensive experiments demonstrate the efficience of our algorithms.
Keywords/Search Tags:similarity join, complex distance, MapReduce, data partition, load balance
PDF Full Text Request
Related items