Font Size: a A A

Design And Implementation Of Similarity Self - Connection Algorithm For Massive Data Sets Based On MapReduce

Posted on:2017-02-20Degree:MasterType:Thesis
Country:ChinaCandidate:G H BaoFull Text:PDF
GTID:2278330485991393Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Similarity Self-Join is a very important study in many applications.It is widely used in data cleaning, document similarity analysis,density-based clustering.For the massive data sets, MapReduce can provide an effective distributed computing framework, and Similarity Self-Join can also be applied on the framework of MapReduce. But there still exists some problems, such as the purpose of applying fine partition method for the cluster data area is to achieve balance of load, which is not easy to implement. Existing algorithms can’t accomplish Similarity Self-Join operations for the massive data sets effectively. There is a lot of unnecessary computation and are not adapted to data of high-dimension. In this paper, we propose two novel algorithms of Similarity Self-Join based on the MapReduce framework. There are two main aspects as follows:Since the given MR-DSJ algorithm of Similarity Self-Join result in a large number of unnecessary calculations and only can be applied to low-dimension data. For this problem, coordinate-filtering technique based MapReduce was found. First, we should do the dimension reduction operation to transfer high-dimension data to the two-dimension data, and mesh the space of the distribution of the data, then use dynamic way of sliding window to further reduce the data in candidate sets. Finally, using coordinate filtering technology. This method reduces the computation of unnecessary distance effectively, and the number of candidate sets is greatly reduced. Experimental results show that our method effectively solves the difficulty of great cost of Similarity Self-Join of mass data in high-dimension and big number of calculations, and can improve the efficiency of Similarity Self-Join of high-dimension data.Since the cluster data will be allocated to the same task of Reducer, the problem of imbalance of the total number of tasks which are processed by Reducer will appear. In order to solve this imbalance of load. The inscribed circle data points are filtered and make reasonable calculation. To make data concentrated in the inner of circle and further reduce the computation, and this paper proposes a partition method of hexagonal meshing. Our experimental results demonstrate that the novel method has more efficiency than other join algorithms. For points of cluster data, the efficiency of the algorithm of in-circle method increases by 50%, and the efficiency of the algorithm of hexagon increases more than 80%. The algorithm of in-circle method solves the problem of imbalance of load effectively in level with expectations.
Keywords/Search Tags:Massive dataset, Filter, Similarity Self-Join, Data Partition, Cluster Data, MapReduce
PDF Full Text Request
Related items