Font Size: a A A

Comprehensive Oversampling And Undersampling Study Of Imbalanced Data Sets

Posted on:2017-01-05Degree:MasterType:Thesis
Country:ChinaCandidate:X YanFull Text:PDF
GTID:2308330485491534Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Imbalanced Data Learning is currently a hot research field of machine learning to deal with imbalanced data sets, focused on how the majority of the class without sacrificing the classification accuracy, but also to maximize the improvement of minority class text classification accuracy is to solve uneven data basic requirements set classification problem.This topic in the study of the particularity of unbalanced data, based on the data distribution characteristics and combined with the actual data set each attribute in the classification process,the current commonly used in unbalanced data resampling method was improved, effective to solve the shortcomings in traditional methods, the new method of resampling. In order to ensure that the recognition of the minority class performance, to the corresponding improvement of integrated classifier, finally got one for unbalanced data sets of complete classification learning system.At first, this topic proposed a for unbalanced data sets of undersampling method based on data density distribution. The concept of the algorithm is introduced into the data density, and the majority class data is divided into high density data cluster and low density data cluster,according to different density data cluster, perform different resampling strategy, to achieve the purpose of improving data balance. Experiment by selecting 6 UCI data sets verify, C4.5, SVM as classifier, the method and the random undersampling, KNN- NearMiss methods such as comparison, the experimental results show that this method has better effect to the unbalanced data classification, can effectively improve the recognition of the minority class classifier performance.Secondly, the method by studying the different attributes to identify the different roles of the minority class samples could be divided into dominant property attribute and hidden attributes.For a dominant attributes bias in a small class, the class samples can provide reliable and sufficient information, stealth properties towards the majority of class, the recognition of the minority class form the interference. Therefore, in view of the different attributes, using different replication strategy, finally realizes the synthesis of the minority class sample in the sample quality improvement. Through selecting 6 UCI data sets, and SMOTE, random sampling methods such as comparison, the experimental results show that after the sampling of unbalanced data sets, its data classification learning effect was improved, the recognition rate of minority class.Finally, in order to further improve the recognition rate of minority class, unbalanced dataset to in-depth study of Databoost method, for its excessive emphasis on difficult points and the shortcoming of samples, put forward a new integrated classification method, the method is set at each iteration the difficult points in the sample seed samples, and then use the seed samples to generate synthetic data, and added to the further training in the training sample classifier, end up with a new training set to train the new classifier.
Keywords/Search Tags:Machine Learning, Imbalanced Data, Resampling, Ensemble Learning
PDF Full Text Request
Related items