Study On Natural Neighbors-Based Semi-Supervised And Imbalanced Classification

Posted on:2023-11-23

Degree:Doctor

Type:Dissertation

Country:China

Candidate:J N Li

Full Text:PDF

GTID:1528306821992479

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

The basic idea of classification is to use a data mining model learned from labeled data to recognize the unlabeled data.Semi-supervised classification and imbalanced classification are the two most typical classification tasks.In semi-supervised classification,the number of labeled samples is limited due to the limitation of cost and time.A large number of unlabeled samples are available.The semi-supervised classification model can use a small number of labeled samples and a large number of unlabeled samples to train an effective classifier.In imbalanced classification,the number of samples in one class(the minority class or positive cases)will be much smaller than that in other classes(the majority class or negative cases).The main purpose of the imbalanced classification model is to improve the classification accuracy of minority class samples.The natural neighbor is a relatively new concept of the nearest neighbor.It is parameter-free,is more suitable for manifold data and can exclude outliers.Based on the concept of the natural neighbor,this paper studies the self-labeling method(or self-training method)for the semi-supervised classification and the oversampling method for imbalanced classification,which solves challenges in existing approaches in semi-supervised and imbalanced classification.The main innovations and contributions of this paper include the following aspects:(1)A semi-supervised self-training method(STDPNF)based on density peak and extended local noise filter is proposed.STDPNF can overcome mislabeling in semi-supervised self-training methods.Besides,STDPNF can overcome the shortcomings of existing semi-supervised self-training methods based on the local noise filter(i.e.,employed local noise filters depends on parameters and can not use a large number of unlabeled samples).Firstly,STDPNF uses density peak clustering to reveal the spatial structure of data and find unlabeled samples with high confidence in the iterative process.Secondly,based on the natural neighbor,an extended parameter-free local noise filter(ENa NE)is proposed.ENa NE can use labeled and unlabeled data to filter out mislabeled samples in the iterative process.Finally,STDPNF iteratively performs the semi-supervised self-training process,and can effectively train a given classifier with the improved labeled data.Experiments prove that ENa NE in STDPNF is better than 5 representative local noise filters.STDPNF outperforms 4 popular semi-supervised self-training methods.(2)A semi-supervised self-labeling framework(LC-SSC)based on local cores is proposed.LC-SSC can solve the problem that semi-supervised self-labeling methods are limited by the number and distribution of initial labeled data.Besides,LC-SSC can overcome the shortcomings in the existing solutions,i.e.,the poor performance in the case of few labeled data and the inability to handle non-spherical data distribution.Firstly,LC-SSC uses the local core technique based on natural neighbors to find unlabeled local cores in the semi-supervised data set.Then,LC-SSC uses active labeling and co-labeling to predict found unlabeled local cores.Next,LC-SSC adds the predicted local cores to the set of labeled data,which improves the number and distribution of the initial labeled data.Finally,LC-SSC can effectively run any semi-supervised self-labeling method with the improved data.Experiments prove that LC-SSC is better than 2 representative semi-supervised self-labeling frameworks.Moreover,LC-SSC can effectively improve the performance of 2 semi-supervised self-labeling methods,especially when there are few labeled data.(3)A synthetic minority oversampling technique(Na NSMOTE)based on natural neighbors is proposed.Na NSMOTE can solve two neighborhood problems in existing oversampling methods(i.e.,depending on the neighbor’s parameter k and using the same number of neighbors for each minority sample to generate synthetic samples).In Na NSMOTE,the interpolation based on natural neighbors is proposed to create synthetic minority class samples.Then,synthetic minority class samples are used to improve the number and distribution of the minority class.The interpolation based on natural neighbors does not need to set parameter k.Moreover,borderline samples have fewer neighbors to reduce the error rate of synthetic samples.The central sample has more neighbors to improve the generalization of synthetic samples.Experiments prove that Na NSMOTE is better than SMOTE.Moreover,6 improved methods based on SMOTE can use the idea of Na NSMOTE to improve their performance.(4)An oversampling method(SMOTE-Na N-DE)based on natural neighbors and differential evolution is proposed.SMOTE-Na N-DE can overcome noise generation.Besides,SMOTE-Na N-DE can overcome the shortcomings of existing filtering-based oversampling methods(i.e.,noise detection techniques depend on parameters and found suspicious noise is removed directly instead of improving).Firstly,SMOTE-Na N-DE uses any SMOTE-based oversampling method to generate synthetic minority class samples.Secondly,SMOTE-Na N-DE uses natural neighbors as the noise detection technique to find suspicious noise in synthetic samples and original data.Finally,SMOTE-Na N-DE uses the differential evolution to optimize found suspicious noise.Experiments prove that SMOTE-Na N-DE can improve the performance of 4 popular oversampling methods and solve their noise generation problems.Moreover,SMOTE-Na N-DE is superior to 4 representative direction-change methods and 3representative filtering-based methods in dealing with noise.

Keywords/Search Tags:

Semi-supervised classification, Imbalanced classification, Self-labeled methods, Oversampling methods, Natural neighbors

PDF Full Text Request

Related items

1	Research And Application Of Semi-supervised Classification Methods For Large-scale Images
2	Research On Imbalanced Dataset Classification In Semi-supervised Learning
3	Research On Methods For Imbalanced Data Classification And Applications
4	Research On Imbalanced Data Classification Methods Based On Probabilistic Oversampling
5	Research On Methods Of Imbalanced Data Set Classification
6	Research On Imbalanced Datasets Classification Based On Machine Learning And Oversampling Methods
7	Researches On Oversampling Methods For Imbalanced Data
8	Research On Sentiment Classification Based-upon Imbalanced Data
9	Protein sequence classification with Bayesian supervised and semi-supervised learned classifiers
10	Research On Methods For Imbalanced Data Classification