Font Size: a A A

On Robust And High-Dimensional Data Clustering

Posted on:2012-02-03Degree:DoctorType:Dissertation
Country:ChinaCandidate:L Q PengFull Text:PDF
GTID:1118330338950122Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of science and technology, the problem of high-dimensional, intensive noise data clustering has become more and more important in the field of data mining. As one of the main tasks in data mining, clustering analysis has been studied extensively in the past decades, and a mass of theories and methods have been put forward. There are, however, many problems in clustering, especially with the wide application of data mining technologies in various industries. Datasets become complicated, and many high-dimensional datasets, even with intensive noises have appeared. New challenges including curse of dimensionality and high sensitivity of results to noise contamination are presented to clustering analysis. In this thesis, we mainly focuses on issues of high-dimensional data clustering and robust clustering, with the main contents outlined as follows:1. A new clustering algorithm based on Gaussian mixture model (GMM) is proposed for discovering clusters with various shape volumes in subspaces. In high-dimensional data, clusters of objects usually exist in subspaces; besides, different clusters probably have different shape volumes. Most existing methods for high-dimensional data clustering, however, only consider the former factor. They ignore the latter factor by assuming the same shape volume value for different clusters. We extend the GMM clustering method to calculate a local weight vector as well as a local variance within each cluster, and use the weight and variance values to capture main properties that discriminate different clusters, including subsets of relevant dimensions and shape volumes. This is achieved by introducing negative entropy of weight vectors, along with adaptively-chosen coefficients, into the objective function of the extended GMM. Experimental results on both synthetic and real datasets have shown that the proposed algorithm outperforms its competitors, especially when applying to high-dimensional datasets.2. A new robust algorithm is presented for solving the sensitivity problem caused by outliers when clustering high-dimensional datasets. Following some existing idea of sample weighting technique, the algorithm assigns a scalar value to each sample to discriminate the role of outliers from that of normal samples during the clustering process, which is supposed to assure the robustness of the algorithm. Accordingly, subspaces for identifying different clusters are obtained from the scalar-value weighted samples. For the first time, the proposed algorithm applies the sample weighting technique to the hard-type clustering method. Experimental results show that the proposed algorithm gains high clustering accuracy on datasets of different dimensions, with various noise ratios added.3. From the view of feature selection, different clusters usually have different subsets of relevant features. A constraint weighting mixture model (CWMM) is proposed for data clustering and local feature selection. We extend existing mixture model to have the distribution of a cluster on any dimension as the weighted sum of two distributions, i. e. the intrinsic distribution of the cluster and the common distribution of all clusters, which capture the specificity of the cluster and the commonness among clusters, respectively. The relative size of weights of the two distributions is supposed to represent the relevance of the dimension to identify the cluster. This is achieved by introducing into the log-likelihood function of the model with the constraints on weights of dimensions for clusters. As a result, dimensions with higher weights for intrinsic distributions of a cluster constitute the localized feature subset of the cluster. Experiments on both synthetic and real datasets reveal that the proposed algorithm is superior to most existing subspace clustering methods in dealing with high-dimensional datasets with overlapping clusters inside.4. To improve the clustering quality of iterative, robust algorithms to outlier-contaminated datasets, we bring out a new robust initialization algorithm based on the uniform effect of K-Means clustering result. The uniform effect of K-Means assures certain phenomena in resulted clusters, especially when much more clusters are obtained, that clusters lying in sparse sample area are of big sizes, clusters lying in dense sample area are of small sizes, and neighbor clusters in dense area have comparable sizes. By first applying K-Means method with an excessive cluster number, we easily obtain the above size relationships between clusters. Then, by merging those small-size clusters lying in the neighborhood, the algorithm obtains dense sample areas in the data space, which can be set as initial clusters. Outliers, however, distribute very sparsely, most of which are clustered into big-size clusters, and thus they affect the initialization process very little. Theoretic analysis and various experiments have shown the effectiveness of the proposed algorithm.5. An adaptive algorithm based on multi-metric Lq norm is proposed for robust clustering on datasets contaminated by severe outliers. The core idea is based on the robust center estimation of Lq norm when outliers are presented. The algorithm adopts Lq norm for measuring distances in clusters, and uses different parameter qk's (qk∈(1,2]) for different clusters to adapt to the different and unknown noise levels in clusters. By applying a nonlinear transformation related to parameter qk to each cluster, an adaptive process for solving qk's are obtained and robust cluster centers are gained. In addition, an outlier detection process is developed with the obtained centers. Experiments on both synthetic and real datasets have shown that the proposed method is superior to many existing methods in both clustering and outlier detection performances.
Keywords/Search Tags:high-dimensional data, subspace clustering, robust clustering, Gaussian mixture models, local feature relevance, initialization
PDF Full Text Request
Related items