Font Size: a A A

Research Of Imbalanced Datasets Preprocessing Combined With Clustering

Posted on:2017-02-18Degree:MasterType:Thesis
Country:ChinaCandidate:J JiangFull Text:PDF
GTID:2348330533950177Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Imbalanced data is widespread in our daily life, which is not balanced distribution and closely related to our daily lives. When using the traditional classification algorithm for imbalanced data, we can not separate minority class data from the whole dataset. Therefore, it requires further study on how to extract minority class data; The overall data get more attention when measure performance of traditional classification algorithms. Minority class data account for a little part in overall data. The overall data can have good classification performance even we ignore the minority class data. The traditional standard can not measure the performance of the classification of minority class data. hile, people tend to pay more attention to the performance of classification of minority class data. Therefore, the classification of imbalanced data deserves further study.At present, there are two ways to solve the problem in classification of imbalanced data. The two ways are data pre-processing and improved classification algorithm.The thesis introduces two different strategies for data preprocessing of imbalanced data.1. The study on undersampling of majority class data introduces membership of fuzzy clustering. According to the membership of each data to the cluster centers we do the reduction weight to the majority class members. In this way, we can make sure that we reduce the imbalance of datasets as far as possible to preserve the original information. The experiment results show that the performance of classification of imbalanced data is improved after using the algorithm.2. The thesis proposes a classification algorithm for imbalanced data based on clustering ensemble and using over-sampling strategy. Bringing in the clustering coefficient of consistency, we can find the border samples of minority class. The oversampling algorithm of border samples is an improved SMOTE algorithm. By this method, the distribution of the new generation of samples becomes more random.
Keywords/Search Tags:imbalanced data, classification, cluster, oversampling, undersampling
PDF Full Text Request
Related items