Research Of Imbalanced Datasets Preprocessing Combined With Clustering

Posted on:2017-02-18

Degree:Master

Type:Thesis

Country:China

Candidate:J Jiang

Full Text:PDF

GTID:2348330533950177

Subject:Computer technology

Abstract/Summary:

Imbalanced data is widespread in our daily life, which is not balanced distribution and closely related to our daily lives. When using the traditional classification algorithm for imbalanced data, we can not separate minority class data from the whole dataset. Therefore, it requires further study on how to extract minority class data; The overall data get more attention when measure performance of traditional classification algorithms. Minority class data account for a little part in overall data. The overall data can have good classification performance even we ignore the minority class data. The traditional standard can not measure the performance of the classification of minority class data. hile, people tend to pay more attention to the performance of classification of minority class data. Therefore, the classification of imbalanced data deserves further study.At present, there are two ways to solve the problem in classification of imbalanced data. The two ways are data pre-processing and improved classification algorithm.The thesis introduces two different strategies for data preprocessing of imbalanced data.1. The study on undersampling of majority class data introduces membership of fuzzy clustering. According to the membership of each data to the cluster centers we do the reduction weight to the majority class members. In this way, we can make sure that we reduce the imbalance of datasets as far as possible to preserve the original information. The experiment results show that the performance of classification of imbalanced data is improved after using the algorithm.2. The thesis proposes a classification algorithm for imbalanced data based on clustering ensemble and using over-sampling strategy. Bringing in the clustering coefficient of consistency, we can find the border samples of minority class. The oversampling algorithm of border samples is an improved SMOTE algorithm. By this method, the distribution of the new generation of samples becomes more random.

Keywords/Search Tags:

imbalanced data, classification, cluster, oversampling, undersampling

Related items

1	Research On Classification Algorithm For Imbalanced Data
2	Research On Imbalanced Data Undersampling Classification Based On Constructive Covering
3	Research On Under-sampling Algorithm For Imbalanced Data Based On Clustering And Its Application
4	Research On Neighborhood-aware Imbalanced Data Sampling Classification
5	Neural Network Approaches For Imbalanced Data Classification
6	Comprehensive Oversampling And Undersampling Study Of Imbalanced Data Sets
7	Research On Under-sampling Classification Method Of Unbalanced Data
8	Comprehensive Oversampling And Undersampling Study Of Imbalanced Data Sets
9	Research Of Imbalanced Data Ensemble Classification Algorithm Based On Oversampling
10	Research And Application Of Imbalanced Data Classification Based On Oversampling Algorithm