Machine learning has become one of the most important research topics in artificial in-telligence, and has been widely used in various fields, such as natural language processing, biometrics, computer vision and handwritten digit recognition. The algorithms in traditional machine learning can be divided into two categories:supervised and unsupervised learning. In supervised learning, Training a good classifier usually needs a lot of labeled samples and unseen samples can be tested using the classifier. However, when there is a small amount of labeled samples, the generalization ability of the trained classifier will decrease and the labeling process is time-consuming and requires extensive expert effort. While in unsuper-vised learning, since the methods do not make use of labeled samples, this will result in the blindness of learning process and the methods can not obtain the desired results. Con-sequently, semi-supervised learning, which makes full use of both labeled and unlabeled samples to train a classifier, has been an interest topic in machine learning field.Semi-supervised learning includes semi-supervised clustering, semi-supervised classi-fication and semi-supervised regression. Based on the analysis of works and existing prob-lem in semi-supervised learning, this thesis mainly investigates semi-supervised clustering and classification methods.Firstly, a semi-supervised Gaussian Mixture Model based on manifold structure is pro-posed to embed manifold assumption into semi-supervised clustering framework. Based on local consistency of labeled and unlabeled samples, the method employ Kullback-Leibler divergence to construct Ï nearest neighbor graph which is used to exploit the underlying manifold structure. Then a structure-based graph regularization term and prior knowledge are incorporated into the objective function of Gaussian Mixture Model, respectively. Fi-nally, the optimal parameters are obtained by EM. The clustering results on several synthetic and real-world datasets shows the effectiveness. And the segmentation results on natural im-ages further demonstrate the practical application to some extent.Secondly, since the performance of kernel minimum squared error relies on the number of labeled samples, we introduce manifold regularized kernel minimum squared error which employs manifold assumption to make use of labeled and unlabeled samples. This method constructs p nearest neighbor graph to exploit the underlying manifold structure, and incor-porates a Laplacian regularization term based on graph Laplacian into the objective function of kernel minimum squared error. The experimental results shows that the method can ef-fectively deal with the problem where there is a small amount of labeled samples.Thirdly, we propose a semi-supervised classification method which uses clustering analysis to improve Self-training. A semi-supervised clustering process is integrated into Self-training process in the propose algorithm. The basic idea is that the algorithm uses semi-supervised clustering method to reveal the actual data space structure and select the samples which have high certainty according to structure information. Then the most confi-dently classified unlabeled samples using a discriminative classifier is added to the labeled set from the selected previous samples. Compared to Self-training, the algorithm compen-sates for the limitation of labeled samples to some extent using semi-supervised clustering technique. When the data space covered by labeled samples is not consistent with the real data space, the result on a synthetic dataset demonstrate that our algorithm achieves a good generalization ability. And the results on the real-world datasets shows the effectiveness and robustness of our algorithm.Finally, we introduce a Self-training algorithm using semi-supervised dimensionality reduction technique and affinity propagation. The algorithm embeds a dimensionality reduction method into Self-training. The advantages include two aspects:on the one hand, the used semi-supervised dimensionality reduction technique can overcome the curse of dimensionality when there is only a small amount of labeled samples, on the other hand, compared to the templates calculated as the sample means or by k-means, the templates obtained by affinity propagation are actual not virtual samples, and it will achieve better results using1-nearest neighbor classifier when the samples are gener-ated from Non-Gaussian distribution. We apply the algorithm to face recognition and the classification results shows that the algorithm has better capability of dealing with high-dimensional data and higher recognition accuracy than the other Self-training methods. |