| In machine learning,the class imbalance issue limits the performance classification.The class imbalance problem refers to that when the sample size of minority class is far more less from that of majority class,traditional learning models are biased to the majority class,leading to the poor classification performance for the minority class.In recent years,the class imbalance problem has attracted much researchers’attention when dealing with real data sets from the fields of medical diagnosis,intrusion detection and credit rating.simultaneously,since the rise of network technology and communication technology,modern society has entered an era of information explosion.This also brings a problem:the original data contains a lot of invalid and redundant information.Beacause the quality of data directly affects the classification effect of learning model,how to extract effective information from original data has become a key problem.Therefore,feature selection has become a crucial step in the field of data mining.In practice,the characteristics of class imbalance and high feature dimension often occur at the same time.Past studies have confirmed that by selecting features more relevant to the minority class,the feature selection algorithms effectively improve the generalization ability of the follow-up learning model on imbalanced data and reduces the time complexity However,the existence of redundant features is still one of the limitations of feature selection algorithm.The main work of this thesis is to study the feature selection algorithm for imbalanced data by reducing feature redundancy.Aiming at the class imbalance issue,dimension curse and the limitations of LDA based feature selection algorithm,the main research contents and innovations of this thesis are as follows:To address the class imbalance problem and reduce the feature redundancy,a feature selection algorithm GRM-DFS based on minimizing the global redundancy is proposed.However,the class imbalance issue is not considered by most feature selection algorithms.A regularization of LDA,IR-LDA which emphasizing minority class is proposed to improve the classification performance of minority class.The regularization IR-LDA is combined with global redundancy minimization algorithm,which not only considers the class imbalance problem,but also reduces the redundancy of feature subsets.Comparison experiments show that the proposed regularization IR-LDA significantly improves the performance of classfication and reduces the redundancy of feature subsets.According to the experimental results,the performance of the proposed GRM-DFS algorithm is obviously superior to other comparative algorithms.To deal with the problems that the LDA-based feature selection algorithms exist for high-dimensional and imbalanced data,an improved LDA-based feature selection algorithm is proposed.The non-diagonal elements of the within-class scatter matrix of LDA are computed by covariance,which mean to reflects the relationship between features.Due to the limitations of covariance,the squared pearson correlation coefficient replaced covariance to calculate the correlation between features.Then,combined with the improved LDA and L2 sparse paradigm,a discriminant feature selection method considering the class imbalance problem is proposed,and the effectiveness of the algorithm is verified.Previous studies have shown that reducing the overlap degree of data can effectively improve the performance of classfication algorithm on high-dimensional,imbalanced data.By increasing the weight of minority class when calculating the overlap degree and combineing the global redundancy minimization algorithm with adaptive parameters,the evaluation metric of overlap degree is improved.To minimize the redundancy of feature subsets while taking advantage of overlap degree,a feature selection method MODFS based on improved overlap degree and global redundancy minimization is proposed.Experiments on imbalanced data show that the proposed algorithm can effectively improve the classification performance of the subsequent learning model. |