Font Size: a A A

Density-based Statistical Merging Clustering Algorithm

Posted on:2017-08-25Degree:MasterType:Thesis
Country:ChinaCandidate:B B LiuFull Text:PDF
GTID:2348330503495645Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
In recent years, with the rapid development of the national economy and the wide application of network technology, data source is constantly expanding, the size of data sets is gradually increasing, and data structures are becoming increasingly complex, how to get useful information from large-scale data with complex structure becomes the current research focus.As an important data analysis technique in the field of data mining, cluster analysis has a wide range of applications in pattern recognition, information processing, machine learning, and so on. Due to the uniqueness of initial conditions and clustering criteria, a variety of clustering algorithms are emerged. However, in the face of large data sets which have inter-class similarity, intra-class difference, noise and overlap issues, the limitations of existing clustering algorithms are becoming more and more obvious.For the ability of traditional clustering algorithm to deal with noise and overlap is poor, the paper is used to propose a density-based statistical merging clustering algorithm(DSM) from a statistical point of view. The algorithm innovatively takes each feature of data points as a set of independent random variable, and gets statistical criteria from the independent bounded difference inequality, Meanwhile, combined with the density information of data points, the DSM algorithm takes the descending order of the density as the merging order in the process of condensation, and achieves the statistical merging of date points belonging to different types. The experimental results of artificial datasets and real datasets show that, the DSM algorithm can not only deal with convex data set, but also have good clustering effect on data set of non convex shape, overlapping and noisy. This fully proves that the algorithm has good applicability and validity.To tackle the failure of traditional clustering algorithms in dealing with large-scale data, the paper proposes a density-based statistical merging algorithm for large data sets(DSML) from the point of view of data sampling. This algorithm is a generalization of the DSM algorithm in the application area. Firstly, DSML obtains a new sampling algorithm(Statistical Leaders algorithm) by improving Leaders algorithm with the statistical merger criteria; Secondly, combined with the Statistical Leaders algorithm and DSM algorithm, DSML completes the clustering of the whole data set. Theoretical analysis and experimental results show that, DSML algorithm can obtain a more representative sample set, has nearly linear time complexity, can handle arbitrary data sets, and is insensitive to noise data, which are very helpful to deal with large-scale data sets.
Keywords/Search Tags:clustering, density, random variable, statistical merging, sampling, leader
PDF Full Text Request
Related items