Co-training Method Research Based On Sample Selection Strategy

Posted on:2021-10-20

Degree:Master

Type:Thesis

Country:China

Candidate:Y L Gong

Full Text:PDF

GTID:2518306194491284

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

In the real world,there are quite a number of data sets composed of a large number of unlabeled data and a small number of labeled data,therefore,semi-supervised learning has emerged.Semi-supervised learning uses a small number of labeled data and a large number of unlabeled data to carry out pattern recognition,which can reduce the cost of manual labeling and carry out classification more accurately.In addition,another advantage of semi-supervised learning is that it can solve the problems of poor generalization ability of supervised learning model and imprecision of unsupervised learning model.Co-training is a simple and effective semi-supervised learning framework.In this method,two classifiers are trained on two views.Iterative labeled unlabeled samples are obtained through the complementary action of two views.As a kind of semi-supervised learning,co-training has been favored by many researchers because of its multi-view and coordination.However,there are still many problems in co-training methods,such as: which strategy is adopted to select unlabeled data to join the iterative process;how to find unlabeled data with high ambiguity and mark them correctly to avoid the problem of error accumulation caused by mislabeled;co-training trains two classifiers on two views,and how to solve the problem of inconsistent marks of the two classifiers on the same data.This paper studies such problems in co-training,the main work of this paper is as follows:(1)In the process of co-training iteration,the lack of useful information implied by the selection of unmarked samples and the inconsistency of multiple classifier markers will lead to the unmarked samples of error marks.Aimed at the above questions,this paper proposes a co-training method combined with a semi-supervised clustering and the weighted K nearest neighbor.The method in the process of each iteration,first carried out a semi-supervised clustering on the training set,choose the unmarked samples with high membership degree to the naive Bayes classification,and then using the weighted K-nearest neighbor algorithm for multiple classifier classificationinconsistent unmarked samples reclassified.The comparison experiment on UCI data set verifies the validity of the algorithm.(2)In the co-training algorithm,high ambiguity data were easy to be mislabeled,it would decrease the classifier accuracy.Meanwhile,there was less useful information hidden in unlabeled data which were chosen from each iteration.To solve the problem,a co-training method combined with active learning and density peak clustering was proposed.Before each iteration,the unlabeled data with high ambiguity were selected and added it to the labeled sample set by active learning,then we used density peak clustering was used to cluster the unlabeled data to obtain the density and relative distance of each unlabeled data.During iteration,the unlabeled data with higher density and relative further distance were selected to be trained by naive Bayes classification algorithm.The process was iteratively done until the termination condition was satisfied.The experiment results on the 9 data sets of UCI show that,compared with SSLCA,the accuracy of the proposed algorithm in this paper is up to 6.6667 percentage points,with an average improvement of 1.4638 percentage points.(3)Aiming at the problem that it is not always effective to select unlabeled samples by high confidence strategy in co-training,a co-training method based on entropy and multi-criteria is proposed.In this method,before each iteration,the view is divided by entropy.Then,in the two views,the clustering criteria and confidence criteria are adopted to select unlabeled samples for view 1 and view 2 respectively.In addition,in order to ensure that the selected unlabeled samples are more valuable,the role of labeled samples is fully considered in multi-criteria.Experiments on 9 data sets of UCI show the effectiveness of the proposed algorithm.

Keywords/Search Tags:

Co-training, sample selected, K-nearest neighbor, active learning, clustering

PDF Full Text Request

Related items

1	Research Of Nearest Neighbor Classification Algorithm Based On Sample Selection
2	Research On Face Recognition With Single Training Sample
3	Study On Generalized Nearest Neighbor Pattern Classification
4	Learning Structure Features In High Dimensional Data Based On Natural Nearest Neighbor
5	Research On Personalized Recommendation Algorithm Combining User Attributes And User-centric Natural Nearest Neighbor
6	Research On Semi-supervised Self-training Method
7	Theory And Algorithms For Nearest Neighbor Method And Multiple-view Learning
8	Research On The Nearest Neighbor Discrimination Method For Adversarial Sample Detection
9	Optimization Research Of Density Peaks Clustering Algorithm Based On Neighbor Searching
10	Some Novel Classifiers And Their Applications On Face Recognition