Font Size: a A A

Research On Semi-supervised Self-training Method

Posted on:2019-11-29Degree:MasterType:Thesis
Country:ChinaCandidate:J N LiFull Text:PDF
GTID:2428330545472498Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The traditional machine learning technology relies on a large number of labeled samples for training.However,it's difficult to obtain a large number of labeled samples in practical application.In contrast,unlabeled samples are cheap and easy to get.Therefore,more and more people are concerned with semi-supervised learning which can make full use of a large number of unlabeled samples and a small number of labeled samples.As a kind of the semi-supervised learning methods,self-training method is widely used because it's simple as well as effective and does not need specific assumptions.However,selft-training method also exists many problems.Firstly,how to solve the problem that self-training method errorly marks unlabeled sample.Sencondly,the low confidence samples of utilization,in the self-training method,is insufficient.Thirdly,how to select labeled samples to initialize self-training classifier.Finally,how to select unlabeled samples to be iteratively studied by self-training method,so as to be better on improving the generalization of the self-training method.In order to solve these problems above,this paper studies self-training method.The main tasks completed are as follows.An ensemble self-training algorithm based on active learning,confidence voting is proposed.This method uses the strategy of combining confidence voting to solve the problem that voting strategy of ensemble self-training easily marks the samples near the decision boundary and the strategy of confidence is easy to errorly mark the samples with inconsistent class label predicted by heterogeneous ensemble classifiers.The stratege of active learning is used to enhance utilization of samples of low confidence.The experimental results,on the UCI data set,show that the algorithm is superior to the contrast algorithm in performance.An ensemble self-training method combining active learning and confidence voting is proposed.In this method,the initialized samples are selected by the nearest neighbor density method,Avoiding the k neighbor samples around the labeled samples choosed as labeled candidate sets.In this way,it can enable the distance between the initialized samples to be dispersed as far as possible,making the marked sample set more information.In order to improve the performance of data editing,semi-supervised KNN is used to replace WKNN,offsetting the shortcoming that data Edting using WKNN only takes into account the influence of labeled samples on the sample tested,but does not make use of the unlabeled samples around the samples tested.Thus,the technology of data editing is better to solve errorly marking problem of self-training method.Finally,the effectiveness of the algorithm is verified by a comparative experiment on the UCI data set.A self-training method combining semi-supervised clustering and data editing is proposed.This method performs semi-supervised clustering with a small number of labeled samples and a large number of unlabeled samples at every iteration of self-training method,and selects unlabeled samples with high membership degree to be classified by NB.The unlabeled samples selected by this strategy are more representative than the random selected unlabeled samples.Then the algorithm use semi-supervised data editing technique to filter out samples which have high cluster membership degree and are errorly marked by NB.The validity of the algorithm is proved on the UCI data set.
Keywords/Search Tags:Semi-Supervised Learning, Self-Training, Data Edting, K Nearest Neighbor, Clustering
PDF Full Text Request
Related items