Font Size: a A A

Some contributions to semi-supervised learning

Posted on:2011-01-10Degree:Ph.DType:Thesis
University:Michigan State UniversityCandidate:Mallapragada, Pavan KumarFull Text:PDF
GTID:2448390002461796Subject:Computer Science
Abstract/Summary:
Semi-supervised learning methods attempt to improve the performance of a supervised or an unsupervised learner in the presence of "side information". This side information can be in the form of unlabeled samples in the supervised case or pair-wise constraints in the unsupervised case. Most existing semi-supervised learning approaches design a new objective function, which in turn leads to a new algorithm rather than improving the performance of an already available learner. In this thesis, the three classical problems in pattern recognition and machine learning, namely, classification, clustering, and unsupervised feature selection, are extended to their semi-supervised counterparts.;Our first contribution is an algorithm that utilizes unlabeled data along with the labeled data while training classifiers. Unlike previous approaches that design specialized algorithms to effectively exploit the labeled and unlabeled data, we design a meta-semi-supervised learning algorithm called SemiBoost, which wraps around the underlying supervised algorithm and improve its performance using the unlabeled data and a similarity function. Empirical evaluation on several standard datasets shows a significant improvement in the performance of well-known classifiers (decision stump, decision tree, and SVM).;In the second part of this thesis, we address the problem of designing a mixture model for data clustering that can effectively utilize "side-information" in the form of pair-wise constraints. Popular mixture models or related algorithms (K-means, Gaussian mixture models) are too rigid (strong model assumptions) to result in different cluster partitions by utilizing the side-information. We propose a non-parametric mixture model for data clustering in order to be flexible enough to detect arbitrarily shaped clusters. Kernel density estimates are used to fit the density of each cluster. The clustering algorithm was tested on a text clustering application, and performance superior to popular clustering algorithms was observed. Pair-wise constraints ("must-link" and "cannot-link" constraints) are used to select key parameters of the algorithm.;The third part of this thesis focuses on performing feature selection from unlabeled data using instance level pair-wise constraints (i.e., a pair of examples labeled as must-link pair or cannot-link pair). Using the dual-gradient descent method, we designed an efficient online algorithm. Pair-wise constraints are incorporated into the feature selection stage, providing the user with flexibility to use unsupervised or semi-supervised algorithms at a later stage. The approach was evaluated on the task of image clustering. Using pair-wise constraints, the number of features was reduced by around 80%, usually with little or no degradation in the clustering performance on all the datasets, and with substantial improvement in the clustering performance on some datasets.
Keywords/Search Tags:Performance, Semi-supervised, Clustering, Data, Pair-wise constraints
Related items