Font Size: a A A

Research And Application Of Ensemble Learning Based On Combined Resampling Methods

Posted on:2012-09-30Degree:MasterType:Thesis
Country:ChinaCandidate:G Q LiuFull Text:PDF
GTID:2218330338465211Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The classification of imbalanced datasets and ensemble learning are the main research content in the current machine learning field. Most conventional classification technologies assumed that, there are in same distribution for the different class of training data. The misclassification at any situation brings the same error costs. Based on these assumptions, the classifier operates on with the goal of highest accuracy. So, when solving the classification of imbalanced data, the traditional classifier meets the problem that the classification performance of the minority samples has greatly reduced and it performs poorly in the practical engineering application.Imbalance data always has characteristics that the minority data is absolute or relative exiguous, the noise data has more interference for learning, and there are many data debris. Single classifier can hardly perform accurately. There are kinds of methods for improving the classification performance about imbalance dataset, such as Data Resampling, Training Dataset Division, Feature Selection, Cost Sensitive Learning, Classifier Ensembling, One-Class Learning, and etc. This paper describes that the classification performance can not be improved only by data processing methods or algorithm improvement. Now, the common resampling methods such as SMOTE have many problems such as sparse distribute in minority class samples, blindness during data expansion, and the information loss of majority class samples. Ensemble learning methods, such as Adaboost, have the problems of overfitting or degradation of classifier performance. It is a worthwhile research that the overall classification performance does not be affected while improving the accuracy of minority.There are three topics discussed in this paper that how to improve the distribution of imbalanced datasets, adjust the algorithm properly, and appropriate evaluation method of classifier performance. A novel method, named as TSNIMA, is proposed, which is a Classifier Ensembling method, including combined resampling technology and improved Adaboost algorithm. The combined resampling method has the function of adaptive selecting neighborhood based on threshold. It can extend data of minority class according to data distribution characteristics when applying SMOTE method. Therefore, it can reduce the effect of extension data from minority class with sparse distribution, and improve the imbalance degree in training data. Because there is inappropriate weight adjustment method for imbalance data in the learning phase of Adaboost algorithm, which the weights of samples of different class are changed by the total error of all classifier, we apply the various kinds of weight adjustment strategy. The method can effectively prevent the deterioration of the classifier performance from the boundary sample and noise data during learning process. So the recognition ratio of samples of minority class is improved. We implement the above methods and add it to Weka Platform. The UCI standard data are applied to the experiment. The results show that TSNIMA algorithm performs superior to other algorithms, such as SMOTEboost, TSNIMA, One-Class Learning method.It is innovative that applying the combination method to solve the classification of tobacco flavor problem, including resampling and ensemble learning. The experiment results show that TSNIMA ensemble classifier performs better when it is against tobacco flavor data which have higher imbalance degree. Compared to other methods, TSNIMA makes the lowest classification error of minority samples in all. At the same time, the classification accuracy of majority samples has a little increasing. In the end, we find that TSNIMA model also can extract more valuable rules for users by using decision tree as TSNIMA base classifier. Through the practice application, it is testified that this method has good robustness and popular application value.
Keywords/Search Tags:imbalance data, resampling, ensemble learning, classification, SMOTE, Adaboost
PDF Full Text Request
Related items