Font Size: a A A

Research On Sample Denoising In Semi-Supervised Co-Training Algorithm

Posted on:2022-08-08Degree:MasterType:Thesis
Country:ChinaCandidate:X GongFull Text:PDF
GTID:2518306530962469Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Compared with traditional machine learning methods,the advantage of semi-supervised learning is that it can simultaneously utilize scarce labeled samples and massive unlabeled samples to train the model.Semi-supervised learning inherits the advantages of supervised learning and unsupervised learning and avoids the shortcomings of both,meanwhile improves the generalization and accuracy of the model.As an important research direction in semi-supervised learning,the main idea is to train two classifiers based on two fully redundant views,and realizes the classification of unlabeled samples by iterative cooperation between the classifiers.Co-training algorithm has the characteristics of multi-view complementarity,and also has good performance and robustness on the circumstance that training data consists of rare labeled samples and a large of unlabeled samples.So it has been widely studied and applied in many areas.However,the noise problem is still the key to improve the performance of co-training algorithm.In the co-training algorithm,noise comes from many aspects.For example,noise samples exist in the initial training set,which will lead to large errors in the initial stage of the model,and will accumulate and increase as the training process progresses to form a vicious circle.The noise feature in the data is not considered in view segmentation,which will generate a lot of noise and consume large time and memory on high dimensional data.In the data of absence of two fully redundant views,how to effectively segment the view is the most important.That ensures two independent and complete classifiers can be trained to better achieve synergistic effect,meanwhile effectively avoid the problem of noise introduced by the weak classifier in the classification process.The processing of inconsistent samples labeled by two view classifiers is also the key to reducing the noise in the iteration process.In this paper,the problem of sample denoising in semi-supervised co-training is studied.The research work mainly includes the following contents :(1)In the standard co-training algorithm,the inadequate redundancy of view segmentation results in error accumulation of two classifiers and the inconsistency of classification for the same unlabeled samples.To solve the problem,a co-training algorithm combines information gain rate and K-means clustering is proposed.The information gain rate of each feature is calculated according to labeled samples,and features with high information gain rate are evenly divided into two views.That can avoid the over-fitting problem and solve the problem of insufficient redundancy in view segmentation.Then K-means clustering is used to find the cluster of inconsistent samples in each classification process,and the sample is relabeled according to the principle that the samples in the same cluster have the highest similarity..(2)A co-training algorithm based on weighted principal component analysis and improved density peak clustering is proposed.Firstly,this method introduces feature weight coefficients to represent the importance of each feature based on the traditional principal component analysis,and the low-weight feature is treated as noise feature that generating interference information and is deleted subsequently.Then,the key feature are evenly divided into two views in view segmentation,so as to better realize the synergy of the two classifiers.Finally,the improved density peak clustering is used to determine the category of labeled inconsistent samples,which effectively reduces the probability of mislabeled samples forming noise.(3)Aiming at a more flexible and systematic processing mechanism of noise in co-training,a co-training algorithm based on adaptive data density editing is proposed.Firstly,a novel noise filter is created based on data density,which has good recognition effect for boundary noise and outlier samples.Then,the monitoring quantity is set for each unlabeled sample to detect the credibility of the category.That can ensure the sample is correctly labeled as much as possible at the beginning,and also reduce introduction of large amount of noise.Finally,an adaptive editing strategy based on PAC theory and monitoring quantity is proposed,and is integrated into the co-training framework to deal with noise.In each training process,the method can automatically execute noise processing mechanism according to amount and state of noise,and ensure the classification error rate is reduced and the amount of labeled samples is increased.Experimental results on 12 UCI data sets prove the effectiveness of the algorithm.
Keywords/Search Tags:co-training, noise processing, weighted PCA, adaptive editing, PAC theory
PDF Full Text Request
Related items