Font Size: a A A

Semi-Supervised Clustering Analysis And Its Extended Research

Posted on:2010-08-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:X S YinFull Text:PDF
GTID:1118330338495767Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In the machine learning and data mining communities, one is often confronted with a great number of unlabeled data. As we know, in many real applications, labeled instances are often difficult, expensive, or time consuming to obtain, as they require the efforts of experienced human annotators. Consequently, prior knowledge of instances is used to partition the data, which has been the focus in the field of machine learing. Semi-supervised clustering (SSC) addresses this problem by using large amount of unlabeled data, together with prior information of a few instances, to give higher accuracy.It also naturally is applied to unsupervised clustering to improve clustering performance. Thus, SSC has become an importanct topic of machine learning and data mining communities. In this dissertation, we improve on SSC algorithms closely following the two cores of the SSC research, learning algorithm and metric learning. With the popular discriminant analysis and kernel trick in machine learning, we propose the corresponding imrpoved SSC models, and generalize them to unsupervised learnnig. The main contributions of this dissertation are summerized as follows:(1) This dissertation presents a discriminative semi-supervised clustering analysis algorithm with pairwise constraints (DSCA). The proposed algorithm uses supervised information and abundant unlabeled instances to simultaneously perform clustering and dimensionality reduction with linear discriminant analysis (LDA). Most existing SSC algorithms either focus on supervised information that directs clustering and are not designed for handling high-dimensional data, or ignore the relationship between clustering and dimensionality reduction. The key idea in DSCA is to intergrate dimensionality reduction and clustering in a joint framework so that the above problems can be mitigated. Meanwhile, pairwise constraint based K-means algorithm (PCBKM) presented in DSCA both reduces the computational complexity of constraints based semi-supervised algorithm and solves the problem of violating pairwise constraints in the existing semi-supervised clustering algorithms. The experimental results show that DSCA can effectively deal with high-dimensional data and provide an appealing clustering performance compared with state-of-the-art SSC algorithms.(2) With the widely-used kernel method, this dissertation presents an adaptive semi-supervised clustering kernel method based on metric learning (SCKMM). The main characteristics of this method can be summarized as follows: i) the SCKMM introduces metric learning into nonlinear SSC to improve separability of the data for clustering; ii) the SCKMM makes effective use of cluster membership as the bridge connecting metric learning and SSC. With this connection, clusters are effectively discovered in the transform space, while the transform space is adaptively re-adjusted for global optimality; iii) we construct an objective function from pairwise constraints to automatically optimize the parameter of the Gaussian kernel, aiming at the problem of manually tuning the kernel parameters. This new design idea helps explore how hyper parameters affect the performance of the algorithm and provide a novel way of solving similar hyper parameters; iv) influence of noisy pairwise constraints on clustering performance is preliminarily investigated, which is ignored by most of semi-supervised clustering algorithms. We randomly flip a lot of pairwise constraints to the opposite constraints to form corresponding noisy constraints. The experiments on noisy pairwise constraints have shown that the SCKMM method seems more robust compared with similar works.(3) This dissertation develops a more generalized and effective discriminant clustering learning framework: i) this dissertation introduces a methiod to effectively integrate generalized linear discriminant analysis (GLDA) and regularized soft K-means (RSKM). Under this framework, the membership degree of the data is relaxed from hard binary-valued of {0, 1} to the soft interval [0, 1], so that adaptive dimension reduction using discriminant analysis and K-means clustering (LDA-Km) becomes its special case; ii) after maximum entropy is introduced as a regularized term, the proposed entropy regularized soft K-means algorithm for discriminant analysis (ResKmeans) in this paper can obtain more effective partitions with the help of GLDA than other similar works, which gains some new insights into the description of the real-world data; iii) we show that GLDA and soft K-means clustering optimize the same objective function to achieve both minimization of the within-cluster scatter and maximization of the between-cluster scatter; iv) The formulation of GLDA generalizes the famous LDA; V) the ResKmeans mothod leads to a natural generalization for existing similar works and can also accommodate other off-the-shelf soft clustering algorithms such as mixture of Gaussians, fuzzy c-means and so on.
Keywords/Search Tags:Semi-supervised Clustering, Pairwise Constraint, Clustering Analysis, Closure Centroid, Metric Learning, Discriminant Analysis, Soft Clustering, Dimensionality Redaction
PDF Full Text Request
Related items