Font Size: a A A

Feature Selection And Semi-supervised Classification For Imbalanced Data

Posted on:2018-08-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:L M DuFull Text:PDF
GTID:1318330542955076Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Imbalanced data exists widely in real life.People pay more attention to the rare classes,so it is necessary to select the features which are more favorable to identify small classes.It is very difficult to obtain labeled samples in many practical applications,so it is very meaningful to make use of few labeled samples and a large number of unlabeled samples effectively.In this dissertation,we mainly study feature selection methods and semi-supervised classification algorithms for imbalanced data.Feature selection method for imbalanced data based on ReliefF and clustering is studied;Feature selection for imbalanced data based on GA is studied;Semi-supervised classification algorithm for imbalanced data based on evidence theory and Biased-SVM is studied;Semi-supervised feature selection method based on genetic algorithm and Biased-SVM is studied.The main achievements are as follows:In the first part,according to the problems of radio signal recognition,a new feature selection approach based on ReliefF and clustering is presented at first.Then,using Bagging algorithm,a new feature selection method for imbalanced data sets based on ReliefF and clustering is presented on the basis of this method.Firstly,many sample subsets are randomly selected from the sample set of the major class using Bagging algorithm,the sample size of each selected sample subset is equal to that of the minor class,and multiple new training sets are composed of the selected sample subsets and the minor class respectively.Secondly,many feature subsets are got by using the feature selection approach based on ReliefF and clustering.Finally,the final selected feature subset is generated by voting mechanism for ensemble learning.Experiments show that the proposed algorithm is effective on radio signal recognition of ground-air communication,and it can not only reduce the data dimension but also improve the signal recognition rate of the minor class effectively.In the second part,an improved genetic algorithm(GA)based feature selection method for two classes of imbalanced data is first presented.The proposed method improves the fitness function while SVM is selected as the classifier due to its good classification performance.This method is firstly evaluated using several benchmark datasets and experimental results show that it outperforms the original GA-based feature selection method now that it not only reduces the feature dimension effectively,but also improves the precision of the minor class.Finally,the proposed method is applied to a real world application in radio signal recognition of ground-air communication,which again shows comparatively better perf'ormance.Then this method is extended to multi-class problems.An improved genetic algorithm(GA)based feature selection method for multi-class imbalanced data is presented.This method improves the fitness function through using the evaluation criterion EG-mean instead of the global classification accuracy.The method is evaluated using several benchmark data sets,and the experimental results show that,compared with the traditional feature selection method based on genetic algorithm,the proposed method has certain advantages in the size of feature subsets and improves the precision of the minor classes for multi-class imbalanced data sets.In the third part,a new semi-supervised learning algorithm based on Biased-SVM is proposed for imbalance data sets which have a number of unlabeled samples.The steps of the proposed algorithm are as follows:Firstly,the Biased-SVM model is trained by the initial labeled sample set;Secondly,the trained Biased-SVM model is used to add labels to the unlabeled samples;Thirdly,the new labeled samples are added to the initial labeled sample set,and the Biased-SVM model is retrained;Finally,the classifier performance is tested.Then evidence theory is introduced in order to improve the stability of the annotations.A semi-supervised classification method for imbalance data sets based on evidence theory and Biased-SVM is proposed.First,the stochastic subspace method is used to get different views;second,Biased-SVM model is trained using the initial labelled samples on each view,then the trained model is applied to unlabled samples to get probability outputs;at last,evidence theory is adopted to improve the stability of unlabeled samples signatures.Experimental results on some public data sets show that compared with other methods,the proposed approach can more effectively and stably utilize the unlabeled examples to improve the value of G-mean and minority class F-value under the different rate of labelled sample.In the last part,considering the scarcity of labeled samples and the high feature dimension for imbalanced data,a new semi-supervised feature selection algorithm based on GA and Biased-SVM is proposed.In this method,the Biased-SVM model is trained by the initial labeled sample set and then the trained Biased-SVM model is used to add labels to the unlabeled samples,and the new labeled samples are added to the initial labeled sample set.Finally,the optimal feature subset is selected by the GA-based feature selection method for imbalanced data.Experimental results on several benchmark data sets show that,compared with the semi-supervised feature selection method based on GA and SVM,the proposed method not only reduces the feature dimension,but also improves the precision of the minor class under the different labeled sample rates generally.
Keywords/Search Tags:Imbalanced data, Feature selection, Genetic algorithm, Bagging method, Semi-supervised learning, Evidence theory, Biased-SVM
PDF Full Text Request
Related items