Font Size: a A A

Research On Imbalanced Data Undersampling Classification Based On Constructive Covering

Posted on:2021-04-09Degree:MasterType:Thesis
Country:ChinaCandidate:R Q LiuFull Text:PDF
GTID:2428330629480118Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology,machine learning classification has been widely used in many fields of daily life.By analyzing the available data and constructing model,the classification algorithms can predict the data without label.At present,there are many traditional classification algorithms,such as support vector machines,decision trees,neural networks,and so on.Most of these algorithms are designed for balanced data.However,in practice,most datasets are imbalanced.When these classification algorithms build model for imbalanced data,they tend to be overwhelmed by the majority class and ignore the minority class.Therefore,while ensuring the overall classification accuracy of the dataset,it is important to pay more attention to the minority class samples.The research on imbalanced data classification is mainly considered from the level of dataset and algorithm.At the level of dataset,oversampling improves the classification performance of the minority class by increasing samples,undersampling makes the classification algorithm pay more attention to the minority class by deleting the majority class samples.A large number of studies have shown that these methods can improve the classification performance of the minority class.However,the distribution of samples has not been fully considered.In view of this,this paper studies how to improve the classification performance of imbalanced data by using the spatial distribution of samples.The main research work are summarized as follows:?1?A data cleaning method?SMOTE+CCA?based on Constructive Covering Algorithm?CCA?was proposed.Firstly,Synthetic minority oversampling technique?SMOTE?is applied to generate new minority samples.And then,CCA is used to detect the hard-to-learn samples.Finally,a pair-wise deletion strategy is proposed to remove the hard-to-learn samples.In this method,the hard-to-learn samples can be detected and deleted,which can reduce the complexity of the dataset and improve the classification performance.The effectiveness of the method is verified by experiments.?2?A undersampling method?SDUS?based on Constructive Covering Algorithm?CCA?was proposed.Firstly,CCA is applied to explore the imbalanced pattern of the original data space,a group of sphere neighborhoods can be obtained.In this work,we propose two sample selection strategies from different viewpoints?SDUS1 is diversity based sample selection and SDUS2 is cosine similarity based sample selection?.SDUS1 selects samples with weighted random sampling method according to the diversity of sphere neighborhood.SDUS2 divides sphere neighborhood into four parts based on cosine similarity,and selects samples according to the number of samples in each part.Finally,the model is conducted by ensemble learning and the effectiveness of the method is verified by experiments.
Keywords/Search Tags:Imbalanced Data Classification, Oversampling, Undersampling, Constructive Covering Algorithm
PDF Full Text Request
Related items