Font Size: a A A

Research On Adaptive Varied Density Clustering Algorithm Based On DBSCAN

Posted on:2018-12-21Degree:MasterType:Thesis
Country:ChinaCandidate:S M WangFull Text:PDF
GTID:2348330512476829Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology,the scale of data presents explosive growth.It has practical significance to mine valuable information from the complex data.As an important method of data mining,clustering algorithm is widely used in data analysis and data mining.DBSCAN algorithm is a typical clustering algorithm,which is based on the measurement of data density,can identify the noise and clusters with arbitrary shape in datasets.This paper takes the research on DBSCAN algorithm,and puts forward an improvement method to deal with the high sensitivity to the parameters and the bad effect when dealing with the datasets with varied density.In addition,in order to adapt to the large-scale datasets,a fusion algorithm is proposed in this paper.The work of the dissertation is partly supported by the National Natural Science Foundation of China(No.61172072,61271308),Beijing Natural Science Foundation(No.4112045),and Research Fund for the Doctoral Program of Higher Education of China(No.20100009110002).In this paper,the major researches are as follows:Firstly,aiming at the shortcoming of DBSCAN algorithm to find out clusters with varied densities,this paper proposes the Adaptive Varied-Density-Based Spatial Clustering of Applications with Noise algorithm(AV-DBSCAN).AV-DBSCAN algorithm changes the definition of Eps-neighborhood to MinPts-neighborhood,and chooses the MinPts nearest points as the MinPts-neighborhood.This algorithm can get the parameters Eps and MinPts based on the M-nearest Directed Graph of datasets,reducing the difficulty of selecting the parameters.In AV-DBSCAN algorithm,the cluster consists of two parts:the core cluster and the border cluster.The core cluster can be identified by the strongly connected components of MinPts-neighborhood Directed Graph;and the border cluster contains the residual points,which can weakly connect to the core cluster.With the same time complexity as DBSCAN algorithm,the AV-DBSCAN algorithm can not only improve the accuracy of clustering for datasets with varied density,but also reduce the sensitivity to parameters.Then,in order to adapt to the increasing scale of data,this paper proposes the Balanced Iterative Reducing-Adaptive Varied-Density Based Spatial Clustering of Applications with Noise algorithm(BIRAV-DBSCAN).The algorithm combines the advantages of BIRCH algorithm and AV-DBSCAN algorithm.It can use limited memory resources and less I/O consumption to efficiently cluster the datasets with varied density with less error,and can identify clusters with arbitrary shape and noise in the datasets.Because the algorithm has good scalability,it can be integrated with parallelization technology to increase the processing capacity of large-scale datasets.Finally,in order to evaluate the validity and reliability of AV-DBSCAN and BIRAV-DBSCAN algorithm,two synthetic sample datasets and one text datasets of Sina News are carried out for the simulation experiments.The experimental result shows that the AV-DBSCAN algorithm has higher accuracy to find out clusters with varied densities while it has similar running time with DBSCAN algorithm.BIRAV-DBSCAN algorithm can be more efficient than DBSCAN algorithm,and the running time can be linear with the increase of datasets size.So BIRAV-DBSCAN algorithm is suitable for large-scale datasets.
Keywords/Search Tags:DBSCAN Algorithm, Datasets with Varied Density, Strongly Connected Components, Large-scale Datasets
PDF Full Text Request
Related items