Font Size: a A A

A MapReduce Based Adaptive Density Clustering Algorithm

Posted on:2015-12-28Degree:MasterType:Thesis
Country:ChinaCandidate:Y J YangFull Text:PDF
GTID:2298330452459586Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the explosive growth of data, single thread clustering algorithm can notmeet the massive data clustering in both processing power and storage capacity, soseeking parallel solution is necessary. Google’s MapReduce programming model fordistributed computing has brought new hope to parallel clustering. Therefore, ourpaper presents an adaptive density clustering algorithm based on MapReduce.Firstly, we propose an adaptive density based clustering algorithm which iscalled ADC, in order to overcome the shortcoming that DBSCAN can’t find clustersof varied densities and is sensitive to parameters. Our algorithm defines one point’sdensity as the distance from itself to its k-th nearest neighbor. It uses the change rateof density to find the boundaries between clusters with different densities. If and onlyif the number of a point’s nearest neighbors whose density change rate is less than theuser-specified threshold value is not less than k, it is treated as core, and the thresholdis adjusted automatically and dynamically at runtime. Secondly, we propose aMapReduce-based clustering algorithm called MR-ADC on the basis of ADC. Ourproposed algorithm includes five steps. First, it normalizes the data. Second thenormalized data is divided into several groups. Third, improved algorithm ADC isapplied in each divided blocks, and the points which are close to the demarcation willbe written to HDFS. Fourth, analyses the demarcation points to merge the localclusters as a global cluster. Fifth, it re-labels the local clustering results according tothe map from local clusters to the global clusters to find the final clustering result.Finally, we analysis the algorithms on a Hadoop cluster with four nodes. Ouranalysis includes the effects of clustering algorithms and time overhead. Experimentsshow that, ADC algorithm can find clusters of arbitrary shape, size and density, andsetting parameters is more easily with the adaptive. MR-ADC algorithm can get thesame effect as the ADC, and its time overhead is much less than a single threadedalgorithm. MR-ADC is suitable for handling massive data clustering.
Keywords/Search Tags:MapReduce, adaptive, varied density, KNN, DBSCAN
PDF Full Text Request
Related items