Font Size: a A A

Research On Distributed Fast Clustering Algorithm Based On Mapreduce

Posted on:2018-03-11Degree:MasterType:Thesis
Country:ChinaCandidate:Q L WuFull Text:PDF
GTID:2348330512481645Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology,the scale of the data is increasing exponentially.There are great challenges of the traditional clustering algorithms: 1)a great deal of outliers,high data redundancy and low value density are existed in big data,and the accuracy of clustering algorithm is low;2)the time cost of searching the nearest neighborhood is high when dealing with the large dataset,and the execution efficiency of the clustering algorithm cannot meet the actual demand.To overcome the drawbacks above,a distributed fast clustering algorithm has been proposed based on the full analysis of the characteristics of the data using MapReduce,which realizes the high efficiency and high precision.In order to overcome the high data redundancy and large worthless data points,a distributed data reduction algorithm based on MapReduce has been proposed in this paper.A new sampling algorithm has been used to calculate the rectangular domain and sampling domain of every data point,and the sample points are determined in the sampling domain.Then the sample points are used to expand the sampling in order to reduce the original dataset.Finally,a representative verification algorithm has been proposed to test the sample set.The data reduction algorithm based on sampling has increased both I/O and network costs significantly.To solve the problems of high cost of searching the nearest neighborhood and low efficiency of clustering algorithm,the Map task is used to partition the sample set with the same data size,and the Reduce task is used to cluster the data subset in this paper.So an enhanced density clustering algorithm based extended range query has been proposed.Firstly,an extended range query algorithm based on fixed-grids is used to determine the nearest neighbors and reverse nearest neighbors,and the influence space neighborhood of each point must be established.Then a computational method of outlierness function is presented to distinguish the border points and noise points accurately.The Reduce clustering task output the local clustering results.In order to obtain the global clustering result for the whole dataset,a new local cluster merging algorithm has been proposed.The combined local clusters can be obtained by determining the distribution relationship of the local clusters using the calculation of the distance between clusters.Then the connected subgraph discovery method is used to merge the local clusters.At last,we can get the global clustering result.Our experimental results indicate that the algorithm we proposed performs better in terms of reducing the size of large data,and ensuring the distribution consistency of sample points and theoriginal data points.And the data redundancy is reduced without the loss of information.Meanwhile,the proposed algorithm can find clusters of arbitrary shape and density,and perform better in terms of efficiency and accuracy.
Keywords/Search Tags:Big data, Cluster, MapReduce, Parallel computing, Data mining
PDF Full Text Request
Related items