Font Size: a A A

Research On Self-training Classification Based On The Cut Edge Weight Statistics

Posted on:2021-03-27Degree:MasterType:Thesis
Country:ChinaCandidate:D N WeiFull Text:PDF
GTID:2518306050967149Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
Self-training is one of the most commonly used algorithms in semi-supervised classification.Self-training algorithm trains classifier by autonomous iteration,it is simple and effective.However,in the iterative process of self-training algorithm,unlabeled samples are easy to be misclssified.Moreover,these wrong labels will be used for subsequent iterative training,resulting in error accumulation.Thus,the classification accuracy of algorithm is reduced.In this paper,from the perspective of dealing with mislabeled samples,the self-training classification algorithm is studied in depth.The main work is as follows.In this paper,a self-training algorithm based on density peak and cut edge weight statistics(ST-DP-CEWS)is proposed.In each iterative training,first of all,the space structure of dataset is discovered by finding the density peaks of samples in the process of density clustering.The representative unlabeled samples are selected by using space structure for label prediction.Then,the cut edge weight statistics(CEWS)is used to determine whether these samples are correctly labeled.In the end,The correctly samples are used to gradually update the labeled set.For mislabeled samples,their predicted labels are deleted,and they are reclassified in followed iteration.ST-DP-CEWS algorithm not only makes full use of space structure information of entire dataset,but also reduces the impact of mislabeled samples on self-training algorithm.Therefore,the classification accuracy of self-training algorithm is enhanced.For verifying the effectiveness of ST-DP-CEWS algorithm,it is compared with other related classification algorithms on 14 real datasets.The experimental results clearly show that the proposed algorithm is superior to other algorithms in classification accuracy.Next,the proposed ST-DP-CEWS algorithm is improved from two aspest in this paper.On the one hand,dealing with the mislabeled samples plays a key role in self-training algorithm.In order to identify mislabeled samples more accurately and effectively,the edge weight calculation in CEWS method is improved.For the samples to be tested,the distance between its near neighbor samples is standardized by the maximum distance.And the probability that the sample to be tested has the same label as its near neighbor samples is calculated by the Gaussian kernel function.Then the weight of every edge is calculated.On the other hand,the distance measurement of the algorithm is improved.The Mahalanobis distance,which is relates to the distribution of samples,is used to better calculate the similarity between samples.After these two improvements,ST-DP-CEWS algorithm takes into account the structure and distribution of the whole datasets when calculating the weight of every edge and similarity between samples.Therefore,the classification accuracy of ST-DP-CEWS algorithm will be increased to a certain extent.Experiments on the 14 real datasets show that the improved ST-DP-CEWS algorithm has better classification performance.
Keywords/Search Tags:Classification, Semi-supervised, Self-training, Density peaks, Cut edge weight statistics
PDF Full Text Request
Related items