Font Size: a A A

Improved Methods Of Oversampling And Feature Selection Based On Imbalanced Data

Posted on:2022-07-02Degree:MasterType:Thesis
Country:ChinaCandidate:X H GaoFull Text:PDF
GTID:2518306311466414Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
With the advent of the big data era,the scale of data has been rapidly ex-panded and the types are intricately complicated.Most of the data obtained are unbalanced data with very different proportions in different categories.The information of the minority samples in the imbalanced data is overwhelmed by most of the majority sample data,resulting in a large number of misclassifica-tions,which will reduce the classification algorithm's predictive ability,so it is necessary to study effective methods to improve the recognition rate of minority class.At present,the preprocessing methods of imbalanced data mainly include re-sampling and feature selection technology.In resampling technology,the Borderline-SMOTE oversampling method strengthens the boundary by interpolating the minority samples on the boundary,but this method is very likely to generate noise samples,blur the boundaries of positive and negative classes.In feature selection technology,the feature subsets selected by filter-based feature selection still have some redundant features.Wrapper feature selection is very beneficial for the identification of key features,but it is slower than the filter approaches.Because of the limitations of the above methods,the main research contents of this article are as follows:As traditional oversampling methods are easy to generate noisy samples and blur the boundaries of positive and negative classes,based on clustering,this pa-per proposes an oversampling method.At first,the method clusters the minority boundary samples,dividing the boundary sample set into different clusters at first and restricting the oversampling area to each boundary sample set within a cluster.At the same time,different sampling magnifications are set according to the density of each cluster,and the number of newly generated samples is finely allocated,and each cluster with different samples is oversampled according to the sampling magnification.By comparing the results obtained in six data sets with different imbalance degree with existing oversampling methods,the experimental results show that the cluster-based oversampling method proposed in this paper is better than other oversampling methods in previous studies.In view of the fact that the filter feature selection method ignores the interaction with the classifica-tion algorithm,and the problem of redundant features still exists after screening,this paper proposes a hybrid feature selection method,which effectively combines the advantages of filter and wrapper approaches.First,the filter feature selection method is used to rank the,importance of features,and then the wrapper idea select,ion is introduced in the process of sequence forward search for features to obtain optimal feature,subset,and finally use the integration idea to determine the final prediction result.By comparing with the existing feature selection meth-ods on six data sets with a different number of features,the experimental results show that the hybrid feature selection method proposed in this paper has higher model performance.In response to the problem of imbalanced data in breast cancer diagnosis,based on clustering,this article uses the proposed oversampling method to bal-ance breast cancer data and then uses mixed features method to further feature optimization,and results show that the prediction accuracy of breast cancer risk is improved,and the effectiveness of the proposed method is confirmed.
Keywords/Search Tags:Imbalanced data, Oversampling method, Feature selection, Breast cancer diagnosis
PDF Full Text Request
Related items