Research On Imbalanced Data Undersampling Classification Based On Constructive Covering

Posted on:2021-04-09

Degree:Master

Type:Thesis

Country:China

Candidate:R Q Liu

Full Text:PDF

GTID:2428330629480118

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet technology,machine learning classification has been widely used in many fields of daily life.By analyzing the available data and constructing model,the classification algorithms can predict the data without label.At present,there are many traditional classification algorithms,such as support vector machines,decision trees,neural networks,and so on.Most of these algorithms are designed for balanced data.However,in practice,most datasets are imbalanced.When these classification algorithms build model for imbalanced data,they tend to be overwhelmed by the majority class and ignore the minority class.Therefore,while ensuring the overall classification accuracy of the dataset,it is important to pay more attention to the minority class samples.The research on imbalanced data classification is mainly considered from the level of dataset and algorithm.At the level of dataset,oversampling improves the classification performance of the minority class by increasing samples,undersampling makes the classification algorithm pay more attention to the minority class by deleting the majority class samples.A large number of studies have shown that these methods can improve the classification performance of the minority class.However,the distribution of samples has not been fully considered.In view of this,this paper studies how to improve the classification performance of imbalanced data by using the spatial distribution of samples.The main research work are summarized as follows:?1?A data cleaning method?SMOTE+CCA?based on Constructive Covering Algorithm?CCA?was proposed.Firstly,Synthetic minority oversampling technique?SMOTE?is applied to generate new minority samples.And then,CCA is used to detect the hard-to-learn samples.Finally,a pair-wise deletion strategy is proposed to remove the hard-to-learn samples.In this method,the hard-to-learn samples can be detected and deleted,which can reduce the complexity of the dataset and improve the classification performance.The effectiveness of the method is verified by experiments.?2?A undersampling method?SDUS?based on Constructive Covering Algorithm?CCA?was proposed.Firstly,CCA is applied to explore the imbalanced pattern of the original data space,a group of sphere neighborhoods can be obtained.In this work,we propose two sample selection strategies from different viewpoints?SDUS₁ is diversity based sample selection and SDUS₂ is cosine similarity based sample selection?.SDUS₁ selects samples with weighted random sampling method according to the diversity of sphere neighborhood.SDUS₂ divides sphere neighborhood into four parts based on cosine similarity,and selects samples according to the number of samples in each part.Finally,the model is conducted by ensemble learning and the effectiveness of the method is verified by experiments.

Keywords/Search Tags:

Imbalanced Data Classification, Oversampling, Undersampling, Constructive Covering Algorithm

PDF Full Text Request

Related items

1	Research On Imbalanced Data Oversampling Classification Based On Constructive Covering Algorithm
2	Research On Classification Algorithm For Imbalanced Data
3	Research On Under-sampling Algorithm For Imbalanced Data Based On Clustering And Its Application
4	Research Of Imbalanced Datasets Preprocessing Combined With Clustering
5	Research On Neighborhood-aware Imbalanced Data Sampling Classification
6	Comprehensive Oversampling And Undersampling Study Of Imbalanced Data Sets
7	Research Of Imbalanced Data Ensemble Classification Algorithm Based On Oversampling
8	Research And Application Of Imbalanced Data Classification Based On Oversampling Algorithm
9	Research On Under-sampling Classification Method Of Unbalanced Data
10	Comprehensive Oversampling And Undersampling Study Of Imbalanced Data Sets