Font Size: a A A

Research On Feature Selection Algorithm For High-dimensional Imbalanced Class Data

Posted on:2018-09-20Degree:MasterType:Thesis
Country:ChinaCandidate:G Q WangFull Text:PDF
GTID:2428330566998750Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years,high-dimensional and imbalanced data with both highdimensional and class imbalance problems are becoming more and more emerging in some new fields such as bioinformatics and satellite images.Its complex characteristics pose a serious challenge to data mining research.The class imbalance problem is that when the number of samples in the dataset varies greatly in different categories,the classifier trained is more biased to the majority class,while ignoring the minority class samples which contain important information.High dimensional problem is due to the high dimension of the feature space,the complexity of the classifier,and the overfitting problem,which leads to poor classification results.In the process of high dimensional data preprocessing,it is important to select the low dimensional feature subset which is highly related to the classification target and with minimum redundancy,so as to improve the learning efficiency and classification accuracy.However,in the data which exists class imbalance situation at the same time,traditional feature selection methods tend to choose the feature subset which is beneficial to majority class,which leads to a poor performance in the classification of minority class samples.We firstly introduce the traditional wrapper method SVM-RFE feature selection algorithm which is based on support vector machine and analyze its problems in the face of imbalanced class data,and put forward the improved SSVM-RFE algorithm based on structural support vector machine which optimizes F-measure instead of accuracy to take class imbalance into account.Due to the feature ranking method based on SVM weights can only reflect the correlation between features and class labels,but it can not solve the redundancy problem between features.Therefore,after deleting a large number of irrelevant features using SSVM-RFE algorithm,we construct a series of balanced subset based on class decomposition framework,and the Hilbert Schmidt independence criterion(HSIC)is used to measure the unbiased correlation between features on these balanced subsets.After that,an improved approximate Markov blanket feature selection method(CBMBFS)for feature combination problem is proposed to remove the redundant features.The two-stage feature selection method SSVM-RFE-CBMBFS is proposed in this paper,considering the unbalanced data distribution,can select a set of features that have high distinguishing ability and minimum redundancy between features.Subsequently,a series of experiments are carried out,and a variety of unbalanced data classification performance criterion were used to evaluate the classification results of the algorithm and compared them with the latest algorithm to prove the effectiveness of our proposed algorithm.
Keywords/Search Tags:feature selection, imbalanced data, structural SVM, F-measure optimization, markov blanket
PDF Full Text Request
Related items