Font Size: a A A

Study On Residual Error-Based Clustering Algorithms

Posted on:2020-11-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:Milan Deepak ParmarFull Text:PDF
GTID:1368330602455531Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Clustering is an unsupervised machine learning approach to analyze data by discovering their underlying structure and classify them into different categories based on certain characteristic measures,such as internal homogeneity and external bifurcation.Clustering is essential for uncovering the inherent,potential and unknown knowledge,principles or protocols in the real-world,and has been widely applied in scientific and engineering applications.Over the last three decades,several strategies have been proposed for clustering,such as partitioning-based,hierarchical-based,gridbased,model-based and density-based.However,they may differ significantly in their definition of a cluster and how efficiently clusters are identified.Hence,the clustering result of different clustering algorithms may vary even on the same dataset.Density-based clustering is one of the fundamental approaches to obtain clusters of arbitrary shapes with the presence of noise in large problem space.Basically,the region with high-density,or a set of densely connected data points,is treated as a cluster.Moreover,the clusters are separated by the low-density regions.Among many densitybased spatial clustering algorithms which deal with noise,DBSCAN(Density-based spatial clustering of applications with noise)is probably the most well-known densitybased clustering algorithm engendered from the basic notion of local density.DBSCAN has two major advantages in identifying arbitrary-shaped clusters with outlier detection,namely,the formation of a chain structure of high-density data points(i.e.,core points)and identification of outliers as low-density data points.Nevertheless,there are distinct limitations of DBSCAN: i)the performance of clustering highly depends on the user-defined parameter values.It is sometimes difficult to estimate appropriate values for various datasets without prior knowledge;ii)in the cluster expansion stage,the computational time taken for searching the nearest neighbors is high as it requires computing all pairwise proximities;and iii)the adjacent clusters of different densities cannot be properly identified possibly due to the use of the global density parameters.Recently,Rodriguez and Liao proposed a density-based clustering algorithm called density peak clustering(DPC),which adopted the idea of local density maxima from the mean-shift method and the concept of implementing only single parameter of the distance between data points from K-Medoids.DPC has distinctive features such as i)being able to detect non-spherical clusters based on generated decision graph,ii)less number of control parameters,and iii)relatively low computational complexity.DPC forms clusters in two stages.In the first stage,DPC finds cluster centroids with density maxima,or density peaks,using a well-designed decision graph approach.In the second stage,each remaining data points are assigned to the same cluster as its nearest neighbor of higher density.Although DPC performs well on many datasets,it still has many deficiencies that might limit its applications:First,DPC takes entire data space into consideration when computing the local density,which is then used to generate a decision graph for the identification of cluster centroids,DPC may face difficulty in differentiating overlapping clusters and in dealing with low-density data points.Second,the local density of a data point is highly affected by the cutoff distance parameter Cd,and is computed in different ways depending on the size of datasets,which can influence the clustering performance.The performance of DPC depends greatly on the cutoff distance parameter Cdto estimate the accurate density in terms of the distance between data points.Actually,Cdact like the radius of a point,and for a data point how many numbers of points within its range of cutoff distance will be its density.In general,the selection of Cdis based on a heuristic approach that the average number of neighbors in a data set should only be 1?2% of the entire data set.In some cases,this approach works well;however,it is not a generalized solution.Concretely speaking,DPC uses different methods to estimate the local density of points in datasets of different sizes.According to guidelines,is used to compute the local density of a data point when the dataset is composed of a large number of data points.Otherwise,the Gaussian kernel function is used instead,with pre-specified cutoff distance Cd.Hence,density metric is different for large and small datasets,and the arbitrary selection of cutoff distance Cdcan greatly influence the clustering performance.Nevertheless,there is no unified metric for deciding whether the dataset is small or large and this choice can produce very different clustering results.Robust methods for calculating accurate densities are not available,and different methods are required to estimate density based on the nature of the dataset.Third,selection of cluster centroid,DPC supports the heuristic approach of decision graph where users are asked to identify cluster centroids.Unlike DBSCAN,the DPC clustering finds the cluster centers before data points are assigned.Determining the cluster centroids is of vital importance to guarantee good clustering results.Because it determines the number of clusters and affects the assignation indirectly,obviously,the key to DPC algorithm is how to determine the cluster centroids.Once the cluster centroid is assigned wrong,then there maybe several data points having a lower density that selected centroid being assigned to incorrect clusters,producing poor clustering result.In complex datasets,the user cannot identify the exact number of cluster centroids without having the domain knowledge,because the decision graph is unable to express natural cluster centroids in the presence of uniform density distribution.Manual selection of cluster centroids is a big limitation of DPC in intelligent data analysis.A better choice of selecting cluster centroids is related to the user's observation with respect to the nature of the dataset.As such,the performance of DPC is sometimes limited by manual identification of cluster centroids.Fourth,lacks robustness when dealing with anomalies,which is a beneficial function of clustering algorithms.DPC does not well handle the uneven cluster distribution that the anomalies near to the identified cluster are always considered as part of a cluster regardless of different Cdvalues in use,because there is no “noise-signal cutoff” used in DPC.In such cases,DPC faces the difficulty in identifying the outliers even with varying Cdvalues and it may not be able to find clusters of small sizes or consisting of outliers(relatively speaking)only.Fifth,the assignment strategy of remaining data points,for the complex datasets;having path-based connectivity or spiral bases clusters,once the cluster centroids have been identified,the rest of data points are assigned to nearest neighbor of highest density,therefore DPC behaves like K-Means or K-Medoids clustering method.This is due to fact that DPC does not perform well when the local densities of data points in some or all clusters are randomly distributed,and this may result in having more than one density peaks in one cluster,which is not reasonable.In such cases,it is challenging for DPC to identify the reasonable number and the locations of cluster centroids.Nevertheless,even if the exact number of centroids is identified in the decision graph by DPC,the natural cluster would be split into many clusters.To overcome the aforementioned issues,in this thesis,we have proposed a strategy based on residual error theory for sample local features and proposed four different strategies to overcome the limitations of DPC,and in addition,we also have proposed new clustering methods,which are as follows:The first major part of this research focused on generating decision graph which is better suited for efficient cluster centroid identification by adopting more effective density measurement and to better detect anomalies for comprehensive clustering results by further examining the borderline data points,wherein a residual error-based method was developed.A novel Residual Error-based clustering(REC)method is proposed to better deal with overlapping clusters and low-density data points.In particular,REC learns from DPC in using decision graphs to identify cluster centroids,learn from DBSCAN in determining density connectivities within a neighborhood,and learn from the residual error theory in measuring density.Specifically,REC method adopts the residual error computation to measure the local density of each data point within a neighborhood region N.Moreover,this method treats low-density data points as halo points after obtaining intermediate clustering results and further processes them.As such,the anomalies and borderline data points are better distinguished.The overall process of the REC method consists of the following four stages: First,the residual errors of individual data points are computed as local density measurement.It is obvious that by adopting the residual error computation when measuring the local density,residual-based method only takes the data points within the neighborhood into consideration.In contrast,DPC takes all the data points in the entire dataset into consideration.By only considering the local regional density,residual-based method is capable of generating better decision graphs for cluster centroid identification.Second,the decision graph is generated,cluster centroids are identified,and data points are intermediately assigned to the respective clusters.As a rule of thumb,data points with lower residual error values and higher distance values are selected as cluster centroids.Third,halo points(consists of both borderline points and anomalies)are identified.Finally,anomalies are isolated from halo points and further processed before presenting the final clustering result.During anomaly detection,a halo point with high residual error and low distance value is recognized as an anomaly and the final clustering results are presented.Due to the further analysis applied to halo points,the proposed method is shown to be capable of better identifying and handling various types of anomalies manifested in different patterns in different datasets.To evaluate the performance of REC,we apply it on ten UCI datasets,seven widely used synthetic datasets,and three self-defined datasets and compare the performance of other benchmarking algorithms on the same datasets.F-Score and is used to measure the accuracy of the clustering results.It is encouraging to find that REC achieves the highest F-Score on nineteen out of twenty datasets.In addition,the computational time spent by each algorithm(average of 20 runs for all models)were also reported.Furthermore,the performance of REC was compared in various as aspects such as in identifying cluster centroids,identifying clusters with anomalies,identifying clusters of varying sizes,identifying clusters of irregular shapes,and identifying clusters of different densities.Experimental results on both real-world and synthetic datasets show that the proposed REC method performs better than DPC,DBSCAN,AP,and K-Means clustering methods.The second part of this research focused on developing semi-automatic cluster identification method to eliminate the iterative process of manual cluster centroid selection and dependency on decision graph.A Residual Error-based Clustering with Fragment merging strategy(FREC)is proposed for better identification of cluster centroids and detection of anomalies efficiently without using the heuristic approach of decision graph.The proposed FREC method inherits the strength of density estimation from DPC,density-connectivity within a neighborhood from DBSCAN,structural similarity measure from SCAN,and density measure from residual error theory.Similar to REC,FREC also adopts residual error computation to better estimate the local density of a dataset such that the generated set of residual errors are used to form residual fragments and further process them to identify cluster centroids without using the heuristic approach of decision graph.Specifically,unlike DPC,FREC performs the reverse approach in the process of cluster formation.Initially,FREC forms the cluster by merging residual fragments with higher similarities.Furthermore,FREC identifies the cluster centroids as the data points with relatively lower residual error,which eventually eliminates the need for decision graph.Due to the further analysis applied to residual fragments,FREC is capable of better identifying and handling various types of anomalies manifested in different patterns in different datasets.The overall process of FREC method consists of the following four stages: First,the residual error of each data points is computed as local density measurement.Second,the residual fragment is generated based on the identification of adjoin points within predefined cutoff threshold value Cdof respective data points along with its link structure and their respective neighborhood points.Specifically,a single residual fragment is a structural network composed of a link structure of data point and its identified adjoin points and their respective neighborhood points.Third,based on the principle of structural similarity,structure similarity index(SSI)between each of the residual fragments is computed.As such,the higher the structural similarity value,the higher the probability of aggregation between each fragment will be,and similar clusters are formed with fragments of high structural similarity.Moreover,the cluster centroid of the generated cluster is identified as the data point with the lowest residual error.Finally,anomalies are isolated as the fragments with the least structural similarity with other residual fragments and the final clustering results are presented.To evaluate the robustness and test the feasibility of the proposed FREC,we compare its performance on twelve UCI datasets,four widely used synthetic datasets and three self-defined datasets.After comparing the performance of our proposed algorithm along with all benchmarking models.To demonstrate the effectiveness of our FREC algorithm we use F-Score to assess the quality of clustering results.It is encouraging to find that FREC achieves the highest F-Score on eighteen out of nineteen datasets.Moreover,we also compare the computational time taken by each algorithm(average of 20 runs for all the method).Furthermore,the performance of FREC was compared in various as aspects such as in detecting cluster centroids,detecting clusters with anomalies,detecting clusters of arbitrary shapes,detecting clusters of varying sizes,and detecting clusters of different densities.Experimental results on standard benchmark clustering datasets validate the effectiveness of the proposed FREC method over DPC and other benchmarking clustering methods.The third part of this research deal with techniques to identify low-density clusters having path-based connectivity and to minimize the misclassification rate of REC based method by adopting more efficient low-density data point assignation strategy for lowdensity data points.REC method works very well in correctly identifying anomalous data points and the number of clusters of different densities.However,it still has some limitations when processing some datasets of more complex and low-density connected arc-shaped clusters.REC is able to detect the low-density arc-shaped clusters as halo points but unable to aggregate halo points correctly with varying values of input parameters.Specifically,REC always splits the arc-shaped cluster and mistakenly assigns part of it to the other clusters.Clearly the special distribution of low-density data points contributes to the failure of cluster aggregation,which is caused by inappropriate processing of halo points.Moreover,since the cluster formation by REC method is similar to DPC,REC assigns data points based on the identified cluster centroids,such that REC might divide the natural cluster even if the correct cluster centroid is identified in a natural cluster.Especially,when a dataset consists of an arc-shaped cluster enclosing Gaussian distributed clusters,REC cannot identify the number of clusters correctly even when the correct number of centroids is identified in the decision graph.In order to perform better analysis on halo points for detecting arc-shaped clusters and for improved clustering results,we propose an enhanced REC method named a Path-following Residual Error-based Clustering(PREC)method.PREC adopts the halo network method to unravel low-density data points before they undergo further processing where the halo points will be categorized into borderline points,anomalies,hub points(intermediate data points exist between two or more clusters),and new cluster points based on their characteristics.Due to the implementation of halo network,PREC is shown the competence of better identifying and handling low-density arcshaped clusters manifested in different datasets.PREC introduces two main concepts:“Halo Network Generation” and “Halo Refinement”.After halo point identification,PREC generates a structural network of the halo points and it further implements halo refinement procedure for further classification of halo points.As the rectification of REC,PREC consists of the following five stages: First,compute the residual error of each data points as density measurement and their distance to the nearest data point with the lowest residual error.Second,generate the decision graph,identify cluster centroids and assign data points with their respective cluster labels.Third,identify halo points(composed of anomalies,borderline points,hub points,and new cluster points).Fourth,create a structural network of the halo points based on the distance between each halo points.After generating the halo network,we identify core points for each halo point in every halo network.A core point can be defined as the neighborhood data points for each halo point in a halo network within the predefined cutoff threshold value Cd.After the core point identification process,for each core point,we find its respective cluster labels and identify the total number of neighborhood clusters for each halo network.From the cluster label of each core point,we identify the total number of neighborhood cluster for each halo network.In a nutshell,a single halo network is a structural network consisting of information of the total number of core points in it and the total number of neighborhood clusters.Finally,detect and identify the types of halo points based on halo network characteristics and its relationship with neighborhood clusters,and output the final clustering results(with threadlike clusters identified,anomalies,hub points).To demonstrate the effectiveness and validate the robustness of the proposed PREC method,we test its performance on eight widely-used UCI datasets,seven standard synthetic datasets and three self-defined datasets.F-Score is used to measure the performance of the clustering results of all the benchmarking clustering algorithms.The clustering performance of all benchmarking models is compared.It is evident that the clustering results obtained by PREC are the best on seventeen out of eighteen datasets.Furthermore,the computational time taken by each algorithm(averaging over 20 runs for each algorithm)were also reported.Experimental results on classical real-world and synthetic datasets demonstrate that proposed PREC method achieves superior performance than the REC,DPC and other benchmarking clustering methods.The final part of this research is focused on the identification of low-density ringshaped clusters having path-based connectivity or spiral bases clusters and to eliminate the performance dependency on underlying density estimator,i.e.,cutoff distance parameter Cd.A new hybrid method that combines K-Means method with REC method(KREC)is proposed for better identification of ring-shaped clusters and improved clustering results on large-scale datasets with anomaly detection without using cutoff distance parameter Cd.The proposed KREC method inherits the strength of partitioning data into smaller groups from K-Means and residual error computation as density measurement and data point assignation based on identified cluster centroid from the REC method.A proposed KREC method is a two-phase approach,which uses K-Means to cluster the dataset into relatively small micro-clusters in the partition phase,then implements the concept of RE method to aggregate and perform further analysis on these micro-clusters for finding the natural clusters and anomalies in the merging phase.The KREC method consists of four consecutive stages: First,partition the dataset into a set of micro-clusters by implementing K-Means method.Second,compute the residual errors of each micro-clusters as a measurement for local density and then compute distance.Third,generate decision graph,select cluster centroids and generate clusters by assign remaining micro-clusters to the respective cluster centroids.Finally,anomalies are isolated as the micro-clusters with the highest residual error and the final clustering results are presented.To test the effectiveness,scalability and exhibit the robustness of the proposed KREC,we compare its clustering performance on seven UCI datasets,six synthetic datasets,and nine self-defined datasets with K-Means,AP,DBSCAN,DPC,and REC on the same datasets for comparisons.We adopt F-score to measure the accuracy of the clustering results.We reported the clustering performance among all the benchmarking algorithms(average of 20 independent runs are reported to minimize the difference caused by randomness).It is observed that KREC performs best on twenty-one out of twenty-two datasets.Extensive experiments are performed on several real-life and synthetic complex datasets demonstrate the effectiveness of the proposed method over DPC and other benchmarking clustering algorithms.To summarize,density-based methods could not express the compact relationship of densities at border points of the clusters,and clusters created by residual errorbased methods are more accurate.The residual error-based methods express the true potential of each data point in the dataset and provide a strong foundation for identifying the number of expected clusters and border points.The extensive experimental result and analysis of all the proposed residual error-based methods demonstrate that the compact clusters created by residual error-based methods validate the robustness and effectiveness over complex datasets and it is a plausible and effective density measurement compared to existing density-based approaches.The thesis provides strong theoretical justification and experimental evidence to support the use of residual errorbased methods instead of the density-based model.
Keywords/Search Tags:clustering, residual error-based clustering, density peak clustering, halo points, anomaly detection, residual fragment, halo network, low-density data points, complex datasets
PDF Full Text Request
Related items