Font Size: a A A

Co-training Method Research Based On Sample Selection Strategy

Posted on:2021-10-20Degree:MasterType:Thesis
Country:ChinaCandidate:Y L GongFull Text:PDF
GTID:2518306194491284Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In the real world,there are quite a number of data sets composed of a large number of unlabeled data and a small number of labeled data,therefore,semi-supervised learning has emerged.Semi-supervised learning uses a small number of labeled data and a large number of unlabeled data to carry out pattern recognition,which can reduce the cost of manual labeling and carry out classification more accurately.In addition,another advantage of semi-supervised learning is that it can solve the problems of poor generalization ability of supervised learning model and imprecision of unsupervised learning model.Co-training is a simple and effective semi-supervised learning framework.In this method,two classifiers are trained on two views.Iterative labeled unlabeled samples are obtained through the complementary action of two views.As a kind of semi-supervised learning,co-training has been favored by many researchers because of its multi-view and coordination.However,there are still many problems in co-training methods,such as: which strategy is adopted to select unlabeled data to join the iterative process;how to find unlabeled data with high ambiguity and mark them correctly to avoid the problem of error accumulation caused by mislabeled;co-training trains two classifiers on two views,and how to solve the problem of inconsistent marks of the two classifiers on the same data.This paper studies such problems in co-training,the main work of this paper is as follows:(1)In the process of co-training iteration,the lack of useful information implied by the selection of unmarked samples and the inconsistency of multiple classifier markers will lead to the unmarked samples of error marks.Aimed at the above questions,this paper proposes a co-training method combined with a semi-supervised clustering and the weighted K nearest neighbor.The method in the process of each iteration,first carried out a semi-supervised clustering on the training set,choose the unmarked samples with high membership degree to the naive Bayes classification,and then using the weighted K-nearest neighbor algorithm for multiple classifier classificationinconsistent unmarked samples reclassified.The comparison experiment on UCI data set verifies the validity of the algorithm.(2)In the co-training algorithm,high ambiguity data were easy to be mislabeled,it would decrease the classifier accuracy.Meanwhile,there was less useful information hidden in unlabeled data which were chosen from each iteration.To solve the problem,a co-training method combined with active learning and density peak clustering was proposed.Before each iteration,the unlabeled data with high ambiguity were selected and added it to the labeled sample set by active learning,then we used density peak clustering was used to cluster the unlabeled data to obtain the density and relative distance of each unlabeled data.During iteration,the unlabeled data with higher density and relative further distance were selected to be trained by naive Bayes classification algorithm.The process was iteratively done until the termination condition was satisfied.The experiment results on the 9 data sets of UCI show that,compared with SSLCA,the accuracy of the proposed algorithm in this paper is up to 6.6667 percentage points,with an average improvement of 1.4638 percentage points.(3)Aiming at the problem that it is not always effective to select unlabeled samples by high confidence strategy in co-training,a co-training method based on entropy and multi-criteria is proposed.In this method,before each iteration,the view is divided by entropy.Then,in the two views,the clustering criteria and confidence criteria are adopted to select unlabeled samples for view 1 and view 2 respectively.In addition,in order to ensure that the selected unlabeled samples are more valuable,the role of labeled samples is fully considered in multi-criteria.Experiments on 9 data sets of UCI show the effectiveness of the proposed algorithm.
Keywords/Search Tags:Co-training, sample selected, K-nearest neighbor, active learning, clustering
PDF Full Text Request
Related items