Comprehensive Oversampling And Undersampling Study Of Imbalanced Data Sets

Posted on:2017-01-05

Degree:Master

Type:Thesis

Country:China

Candidate:X Yan

Full Text:PDF

GTID:2308330485491534

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Imbalanced Data Learning is currently a hot research field of machine learning to deal with imbalanced data sets, focused on how the majority of the class without sacrificing the classification accuracy, but also to maximize the improvement of minority class text classification accuracy is to solve uneven data basic requirements set classification problem.This topic in the study of the particularity of unbalanced data, based on the data distribution characteristics and combined with the actual data set each attribute in the classification process,the current commonly used in unbalanced data resampling method was improved, effective to solve the shortcomings in traditional methods, the new method of resampling. In order to ensure that the recognition of the minority class performance, to the corresponding improvement of integrated classifier, finally got one for unbalanced data sets of complete classification learning system.At first, this topic proposed a for unbalanced data sets of undersampling method based on data density distribution. The concept of the algorithm is introduced into the data density, and the majority class data is divided into high density data cluster and low density data cluster,according to different density data cluster, perform different resampling strategy, to achieve the purpose of improving data balance. Experiment by selecting 6 UCI data sets verify, C4.5, SVM as classifier, the method and the random undersampling, KNN- NearMiss methods such as comparison, the experimental results show that this method has better effect to the unbalanced data classification, can effectively improve the recognition of the minority class classifier performance.Secondly, the method by studying the different attributes to identify the different roles of the minority class samples could be divided into dominant property attribute and hidden attributes.For a dominant attributes bias in a small class, the class samples can provide reliable and sufficient information, stealth properties towards the majority of class, the recognition of the minority class form the interference. Therefore, in view of the different attributes, using different replication strategy, finally realizes the synthesis of the minority class sample in the sample quality improvement. Through selecting 6 UCI data sets, and SMOTE, random sampling methods such as comparison, the experimental results show that after the sampling of unbalanced data sets, its data classification learning effect was improved, the recognition rate of minority class.Finally, in order to further improve the recognition rate of minority class, unbalanced dataset to in-depth study of Databoost method, for its excessive emphasis on difficult points and the shortcoming of samples, put forward a new integrated classification method, the method is set at each iteration the difficult points in the sample seed samples, and then use the seed samples to generate synthetic data, and added to the further training in the training sample classifier, end up with a new training set to train the new classifier.

Keywords/Search Tags:

Machine Learning, Imbalanced Data, Resampling, Ensemble Learning

PDF Full Text Request

Related items

1	Research On Imbalanced Data Classification Algorithms Based On Ensemble Learning
2	A Study Of Ensemble Learning Method For Imbalanced Data Classification And Its Applications
3	Classification In Imbalanced Data Based On Over-Sampling And Ensemble Learning
4	Research On Ensemble Learning Approaches To Imbalanced Data Sets
5	Research On Churn Prediction Of Credit Card Customers Based On Resampling And Ensemble Learning
6	Research On Data Resampling Technology For Imbalanced Data Classification
7	Research On Imbalanced Data Classification Methods Based On Resampling And Ensemble Learning
8	Hybrid Ensemble Learning For Imbalanced Data
9	Research On Ensemble Learning Algorithm For Imbalanced Data
10	Research On The Imbalanced Data Learning