Design And Implementation Of Similarity Self - Connection Algorithm For Massive Data Sets Based On MapReduce

Posted on:2017-02-20

Degree:Master

Type:Thesis

Country:China

Candidate:G H Bao

Full Text:PDF

GTID:2278330485991393

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Similarity Self-Join is a very important study in many applications.It is widely used in data cleaning, document similarity analysis,density-based clustering.For the massive data sets, MapReduce can provide an effective distributed computing framework, and Similarity Self-Join can also be applied on the framework of MapReduce. But there still exists some problems, such as the purpose of applying fine partition method for the cluster data area is to achieve balance of load, which is not easy to implement. Existing algorithms can’t accomplish Similarity Self-Join operations for the massive data sets effectively. There is a lot of unnecessary computation and are not adapted to data of high-dimension. In this paper, we propose two novel algorithms of Similarity Self-Join based on the MapReduce framework. There are two main aspects as follows:Since the given MR-DSJ algorithm of Similarity Self-Join result in a large number of unnecessary calculations and only can be applied to low-dimension data. For this problem, coordinate-filtering technique based MapReduce was found. First, we should do the dimension reduction operation to transfer high-dimension data to the two-dimension data, and mesh the space of the distribution of the data, then use dynamic way of sliding window to further reduce the data in candidate sets. Finally, using coordinate filtering technology. This method reduces the computation of unnecessary distance effectively, and the number of candidate sets is greatly reduced. Experimental results show that our method effectively solves the difficulty of great cost of Similarity Self-Join of mass data in high-dimension and big number of calculations, and can improve the efficiency of Similarity Self-Join of high-dimension data.Since the cluster data will be allocated to the same task of Reducer, the problem of imbalance of the total number of tasks which are processed by Reducer will appear. In order to solve this imbalance of load. The inscribed circle data points are filtered and make reasonable calculation. To make data concentrated in the inner of circle and further reduce the computation, and this paper proposes a partition method of hexagonal meshing. Our experimental results demonstrate that the novel method has more efficiency than other join algorithms. For points of cluster data, the efficiency of the algorithm of in-circle method increases by 50%, and the efficiency of the algorithm of hexagon increases more than 80%. The algorithm of in-circle method solves the problem of imbalance of load effectively in level with expectations.

Keywords/Search Tags:

Massive dataset, Filter, Similarity Self-Join, Data Partition, Cluster Data, MapReduce

PDF Full Text Request

Related items

1	Research On Complex Distance Measure Based MapReduce Similarity Join Techniques
2	Research And Design Of KNN-join Algorithm Based On MapReduce
3	Join Method Research Based On MapReduce
4	Design And Implementation Of Data Integration System Based-on Similarity Join
5	Research On Improvement Of Similarity Join In MapReduce
6	Research Of Join Algorithm With Skew Data On Mapreduce
7	Research On Partition Selection Strategy For Big Data Management Based On KNN Connection Processing
8	Research On String Similarity Join Method Based On Hadoop Platform
9	The Research And Implementation Of Comprehensive Mapreduce
10	Research Of Data Partition And Query Optimization Based On Database Cluster