Font Size: a A A

Research Of Clustering Algorithm For Complex Data

Posted on:2017-05-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:M ChenFull Text:PDF
GTID:1318330533951481Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Nowadays,due to the sharply increasing amounts of new-generated data and the growing kinds of disparate data sources,it becomes more and more difficult to get useful information from these complex data.To understand the complex data and unlock its inner pattern is the key to staying ahead of the competition of data utilization.Clustering analysis can discover underlying patterns of data points according to their similarities.But many advanced clustering algorithms have difficulty dealing with variable complex data.Therefore,more effort is needed to improve the clustering performance interms of accuracy and executing speed.Aiming at detecting clusters from complex data more effectivly and efficiently,this thesis presents four clustering algorithms for three kinds of complex data,namely EPC,MulSim,CLUB and SUM.EPC is an easy-to-use,highdimensional algorithm for clustering enviromental pollution sampling data according to the similarity in pollution characteristics,thus it can improve the accuracy of pollution receptor-oriented models including CMB,PMF,UNMIX and PCA.MulSim and CLUB focus on finding clusters with various densities,shapes and sizes.MulSim groups data points based on a clustering strategy that a data point is similar with mutiple data points simultaneously.CLUB detects clusters by finding their density-backbones.SUM is a graph clustering algorithm,which generates clusters by suspecting the connection of the max degree vertices to their neighbors within a cluster.(1)EPC By selecting a first unlabelled point as a cluster centre,EPC assigns each data point in the sampling dataset to its most similar centre of cluster according to both the user-defined threshold and the value of similarity function in each iteration.Then,it modifies the clusters by using a method similar to k-Means.The validity and accuracy of the algorithm are tested on both real and synthetic datasets,which shows that the EPC algorithm can organize all of the sample data into different groups according to the similarities in pollution characteristics as well as detect outliers simultaneously.(2)MulSim MulSim defines a novel distance between pairs of data points based on k-NN,which can automatically adapt different densities when clustering.MulSim groups two data points together if and only if a point is similar with another and the another's multiple nearest neighbors.Experiments show that MulSim performs better than the three classical and the three state-of-the-art methods on a variety of datasets,including various density dataset,uniform density dataset,dataset with multi-centre cluster,dataset with spiral shaped cluster,dataset with convex shaped cluster and multidimensional dataset.(3)CLUB CLUB first finds initial clusters based on mutual k nearest neighbors.Next,taking the initial clusters as input,it identifies cluster density-backbones based on k nearest neighbors.Then,it yields final clusters by assigning each unlabelled point to the cluster to which its nearest neighbor of higher density belongs.Finally,it detects outliers within the final clusters.To comprehensively demonstrate the performance of CLUB,CLUB is compared with six baselines including three classical and three state-ofthe-art methods,on nine two-dimensional various-sized datasets containing clusters with various shapes and densities,as well as seven widely-used multi-dimensional datasets.In addition,Olivetti Face dataset is also used to illustrate the effectiveness of CLUB on face recognition.Experimental results indicate that CLUB outperforms the six compared algorithms in most cases.(4)SUM SUM elegantly defines a similarity function by using the number of common neighbors of two adjacent vertices and the degree of the lower degree vertex.After putting similar vertices into the same clusters,SUM generates initial clusters by suspecting the connection of the max degree vertices within a cluster as follows: it first breaks the connection of the max degree vertices with their neighbors,and then reassigns those max degree vertices.Next,SUM assigns vertices without labels to the initial clusters,and adjusts owners of the border vertices.In experiments,SUM is compared with six baselines including four classical and two state-of-the-art methods,on four graphs with ground-truth and four graphs without ground-truth.Experimental results indicate that SUM can effectivly detect clusters in different graphs,and outperforms the six compared algorithms.The time complexity of each of the four algorithms is approximately linear.Therefore,the four clustering algorithms are all effective and efficient on their corresponding datasets.
Keywords/Search Tags:Clustering, Environmental pollution sampling data, Cluster Densitybackbone, k nearest neighbors, Graph clustering
PDF Full Text Request
Related items