Font Size: a A A

Research On Ensemble Learning Approaches To Imbalanced Data Sets

Posted on:2011-10-05Degree:MasterType:Thesis
Country:ChinaCandidate:X Q WangFull Text:PDF
GTID:2178360308965586Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Classification is one of the most important tasks of machine learning. There exits a default assumption in the classification field that the number of instances in each class is balanced and the goal of the traditional classification methods is to enhance the overall classification accuracy of the whole dataset. However, in many practical problems, the number of data belonging to different classes is imbalanced, and improving the classification accuracy of the data in the minority class becomes important. When facing such datasets, traditional learning algorithms tend to produce high predictive accuracy for majority class but poor predictive accuracy for minority class. The cost of wrong classification is tremendous. As this kind of problems is very common, classifying imbalance data sets has become the focal point of machine learning and pattern recognition research, which is also a large challenge to traditional classifiers.At present, researches on class imbalance problems mainly focus on two aspects: dataset processing and classification method improving. Dataset processing reconstructs a dataset via resampling: reducing its imbalance degree by changing the distribution of the original data. Over-sampling and under-sampling are the most extensively used methods. Novel algorithms are proposed to improve the performance of the existent classification approaches, such as cost sensitivity learning, Boosting methods and so on. Some experts also advise to combine these two kinds of method. Researches along the above mentioned directions have made remarkable achievements in the field of imbalance data classification, but there still exits many issues affecting the reliability and stability of imbalanced data classification, such as over fitting, losing of important information and so on. So under the premise of guaranteeing the classification accuracy for the data in majority classes, how to improve the recognition accuracy for the data in minority classes is an important research topic.In order to enhance the classification accuracy for the data in majority classes, this paper proposes several novel approaches by handling the datasets and modifying the classifical algorithms. The main contributions of this dissertation are summarized as follows:1. Borrowing the idea of cascade structure, the paper proposes a new method named CasBagging that is Bagging classification at the basic of cascade structure to handle class imbalance problems. This method eliminates part of data in majority class at each cascade node, which could make the dataset approach class balance in the end. The finally obtained train data are used to construct a classifier by Bagging. Each individual classifier gained at each cascade node are ensembled and used to classify the data. Experimental results on 10 UCI datasets show that this method outperforms Bagging and AdaBoost.2. When using Neural Networks (NN) to handle class imbalance problems, there exists a fact that minority class makes less contribution to the error function than the majority class, so the network learned prefers to recognize majority class data which we pay less attention to. Using a newly defined error function in BP, this paper proposes a novel algorithm WNN. Experiments executed on 20 UCI datasets show that the approach can effectively enhance the recognition rate of data in minority class.3. A new method named NNSMOTE is also proposed, and this approach is different from SMOTE algorithm. We employ a nonlinear interpolation idea and construct a neural network to generate synthesized instances for the minority class. For each minority class instance, we first get its k nearest neighbors, then put the neighbors to a neural network and train a new instance that fitting the neighbors well; at last, adding the new instance to the dataset as a synthesized instance.
Keywords/Search Tags:Imbalance dataset, Ensemble Learning, Resampling, Neural Networks, BP algorithm
PDF Full Text Request
Related items