Font Size: a A A

GMM Trees And Forests:Hierarchical Algorithms For Estimating The Number Of Clusters In High Dimensional Complex Data

Posted on:2021-06-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:Muhammad AzharFull Text:PDF
GTID:1488306110987359Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Clustering is an important unsupervised data analysis technique that is used to partition the data objects into a set of clusters in a way that the intra-cluster objects are more similar as compared to the inter-cluster objects.A well known problem in data clustering is to estimate the number of clusters inherent in the data in advance as many state-of-the-art clustering techniques require the number of clusters as an input parameter.This problem becomes more challenging in clustering high dimensional data which contain a large number of clusters because the number of clusters specified for a clustering algorithm can adversely affect the clustering results.In this thesis,we propose GMM-Tree,the abbreviation of Gamma Mixture Model Tree,to estimate the number of clusters and initial cluster centers.In addition,we propose several methods which use GMM-Tree to solve clustering and classification problems of high dimensional complex datasets with a large number of clusters/classes.The major contributions of this thesis are given below.First,we propose a new method,with the name of Gamma Mixture Model Tree(GMMTree for short),for estimating the true number of clusters and initial cluster centers in a dataset with many clusters.In this method,the observation points are assigned to the data space to observe the clusters through the distributions of the distances between the observation points and the objects in the dataset.A Gamma Mixture Model(GMM)is built from a distance distribution to partition the dataset into subsets,and a GMM tree is obtained by recursively partitioning the dataset.From the leaves of the GMM tree,a set of initial cluster centers are identified and the true number of clusters is estimated.Second,two GMM Tree based forest algorithms are proposed to ensemble multiple GMM trees to handle high dimensional data with many clusters.The GMM-P-Forest algorithm builds GMM trees in parallel,whereas the GMM-S-Forest algorithm uses a sequential process to build a GMM forest.The experiment results have demonstrated that these algorithms are able to estimate the number of hundreds of clusters in complex high dimensional datasets.Third,an ensemble method(named SSS-GMM Forest method)is proposed to estimate the number of clusters in a very high dimensional noisy and sparse dataset with a large number of clusters.SSS-GMM Forest method first uses the GMM-Tree and k-means algorithms to divide the set of features of the dataset into feature strata.Then,the stratified subspace sampling method is used to sample subspace features from the feature strata and generate a set of subspace datasets from the high dimensional dataset.After that,the GMM-Tree algorithm is used again to identify the number of clusters and initial cluster centers in each subspace dataset for the k-means algorithm to cluster the subspace dataset.Finally,the link-based method is used to integrate the subspace clustering results into an object cluster association matrix from which the final ensemble clustering result is generated by the k-means algorithm with the number of clusters identified by the GMM-Tree algorithm.Finally,we propose a new hierarchical Gamma Mixture Model-based supervised method for classifying high dimensional data with a large number of classes.This method first uses SSS-GMM Forest method to produce ensemble clusters.Then,the dominant class label is assigned to each cluster in the ensemble clustering result.A classification is made on a new object by computing the distance between the new object and the center of each cluster in the classifier and the class label of the cluster is assigned to the new object which has the shortest distance.In this way,this SSS-GMM Forest based method is used to classify high dimensional datasets with a large number of classes.This method outperforms other state-of-the-art methods.Comprehensive experiments were conducted on both synthetic and real-world datasets of diverse nature with different numbers of clusters/classes,features and objects.The experiment results have shown that the proposed approaches outperform several state-ofthe-art methods in finding the number of clusters and initial cluster centers in complex high dimensional datasets.Besides,the experimental results validate the performance of the GMM-Tree based SSS-GMM Supervised method over other state-of-the-art techniques in classifying high dimensional datasets with a large number of classes.The performance shows that these new algorithms can become new analytical tools for analysis of complex high dimensional data.
Keywords/Search Tags:Number of clusters, Initial cluster centers, Ensemble clustering, High dimensional data, Stratified sampling, Unsupervised classification, Decision cluster
PDF Full Text Request
Related items