How to learn from the limited observational data is a classic problem in machinelearning. The traditional supervised learning needs a large number of labeled examples toconstruct a model to predict the label of unobservable data. When the number of labeled datais small, the trained classifier is often difficult to have a good generalization performance. Inunsupervised learning, because there is no labeled data, the direction of the clusteringalgorithm is uncertainty. How to exploits a small amount of labeled data and a large numberof unlabeled data to improve learning performance has become one of the hottest topics inmachine leaning research area.The semi-supervised learning try to use a large number of unlabeled data to improvelearning performance when the labeled data is small, it has a wide range of applications.Based on analyzing the research state and the existing problems of several kinds ofsemi-supervised learning algorithms, the thesis mainly investigates the theory, algorithm andits applications in clustering, classification and dimensionality reduction. The main researchachievements are as follows:1. A semi-supervised clustering algorithm based on graph contraction was proposed inthis paper. At first, the whole data in sample space is represented as an edge-weighted graph.Then the graph is modified by contraction according to must-link constraints and graph theory.This enhanced the effect of the must-link constraints. On this basis, we project sample spaceinto a subspace by combining graph Laplacian with cannot-link constraints. Data clustering isconducted over the modified graph. Although the proposed approach utilizes graph Laplacian,it is used in the modified graph. Experimental results show that the method indeed reaches itsgoal for complex datasets, and it is acceptable when there has small amount of pairwiseconstraints.2. A dimensionality reduction algorithm is proposed by combining graph-basedsemi-supervised learning and active learning. The algorithm integrates graph, semi-supervisedlearning and support vector machine-based active learning to reduce the dimensionality ofhigh dimensional data. First, the high dimensional observational data set is embedded in aweighted graph; use Laplacian eigenmaps to get the initial low-dimensional embedding space. Second, we use support vector machine-based active learning to find the ambiguous samples.We utilize an updated version of graph-embedding; the weighting only takes into accountsamples which are of the same class, using a gravitation constant to attract same-classsamples closer. Repeat this process until the closing conditions met. Experiments show thatthe algorithm has a good performance for dimensionality reduction of high-dimensionalbiomedical image and gene expression.3. An image annotation algorithm is proposed, which uses image segmentation andsemi-supervised learning based on kNN-graph. At first, each image is segmented into severallocal regions, and the kNN-graph of all local regions is constructed from locality-sensitivehashing table. Then the kNN-graph of images is constructed from the kNN-graph of localregions. At last, the graph-based semi-supervised label propagation algorithm is used toannotate the unlabeled images. Experiments show that the algorithm can obtain a higher speedand annotation accuracy. So, the algorithm can be applied to large image data sets.4. We propose a method to calculate the similarity between two images through the localarea of the image. For two images I_{m} and I_{n}, they have been segmented into p^{m} and p^{n} regions. Denote the region set corresponding to image I_{m} as {r_{1}^{m},r_{pm}^{m}}, similarly theimage I_{n}'s region set is {r_{1}^{n},...,r_{pn}^{n}}. Obviously, the similarity between image I_{m} and I_{n} isonly relative to the relationships between the pairwise regions from these two images. So wegive a method to calculate the similarity between the images according to the kNN graph oflocal regions of images.5. We propose a quick method to find the k nearest neighbors of a sample. The methodfind the k nearest neighbors through hash table and the kNN graph construct from sample'slocal properties. Experiments show that this method has a higher speed than linear scan tofind the k nearest neighbors. |