Font Size: a A A

Research On The Explanation For The Effectiveness Of Classical Ensemble Learning Algorithms And The Improvement Of Their Performance

Posted on:2017-03-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:B SunFull Text:PDF
GTID:1318330536468284Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
How to effectively classify new examples with unknown class labels is a very important research topic in data mining.Ensemble learning as a powerful technique of solving this problem has been widely studied and successfully applied in many real applications since its invention,and it has become an important reasearch branche in data mining.Currently,some classical ensemble learning algorithms have been proposed in the literature,such as Bagging,AdaBoost,DECORATE,etc.In addition,researchers have made fruitful works in ensemble learning.However,for the above ensemble learning algorithms,there does not exist a general theoretical tool that can fully explain their effectiveness;besides,under specific training scenarios,some ensemble learning algorithms can not achieve satisfactory performance.This thesis tries to solve these issues,concretely,the main contributions of this thesis are summarized in the following.(1)As different ensemble learning algorithms are proposed from different perspectives,naturally,they have different working mechanisms.Therefore,if we can theoretically analyze the effectiveness of the existing classical ensemble learning algorithms,it can make us get a deeper understanding of these algorithms,and more importantly,it is helpful in finding a more general theoretical tool that is capable of explaining the effectiveness of ensemble learning algorithms,which is of important theoretical significance to the design of new effective ensemble learning algorithms.Inspired by the margin theory,this thesis tries to adopt it to empirically analyze and compare the effectiveness of three most representative classical ensemble learning algorithms,i.e.,Bagging,AdaBoost and DECORATE.Experimental results demonstrate that,for each investigated ensemble learning algorithm,the better the margin distribution it generates on the training set,the higher the test accuracy it obtains on the test set.That is,the margin theory successfully explains the effectiveness of the three algorithms.Therefore,a conclusion can be drawn that the margin theory is a general theoretical tool for explaining the effectiveness of ensemble learning algorithms.Based on this finding,this thesis suggests using the margin distribution as an optimization goal when designing new ensemble learning algorithms.(2)To get a satisfactory generalization performance,most ensemble learning algorithms usually generate a large number of base classifiers to compose an ensemble.However,there may exist some low-accuracy or similar classifiers in the obtained ensemble,which not only increases the storage and computation costs but also decreases the classification efficiency and generalization performance of the ensemble.To solve this problem,this thesis proposes an average margin ordering-based base classifier selection method to select a near optimal subset of classifiers from the initial ensemble.The proposed method utilizes the average margin as the performance evaluation measure for evaluating individual classifiers of the initial ensemble.In addition,this thesis conducts a comprehensive comparison between average margin and two commonly-used performance evaluation measures,i.e.,accuracy and diversity.Experimental results indicate that the proposed base classifier selection method can effectively improve the classification efficiency and generalization performance of the initial ensemble,besides,the average margin is a better performance evaluation measure than accuracy and diversity.This work has important theoretical and practical significance to the performance improvement of classification task in data mining.(3)In some multi-class classification problems,the training sets sometimes contain many ‘noisy' examples whose class labels are incorrectly labeled.Ensemble learning algorithm AdaBoost is very sensitive to this kind of noisy examples and prone to over-fit them,making it non-robust to these mislabeled noises.To solve this problem,this thesis proposes a robust multi-class AdaBoost algorithm Rob_MulAda for mislabeled noisy examples.In Rob_MulAda,a noise-detection based multi-class loss function is formally designed and its minimization problem is solved by proving a proposition;besides,a new weight updating scheme is presented to alleviate the harmful effect of the mislabeled noises.Rob_MulAda is empirically compared with several related algorithms under varying mislabeled noise rates,and the obtained experimental results illustrate that Rob_MulAda can well improve the robustness of AdaBoost against mislabeled noises in the multi-class classification scenario.(4)For many practical applications,the collected training sets usually have imbalanced class distributions.As most base classifier learning algorithms are proposed based on the assumption that the used training set should have a roughly balanced class distribution,thus their generated classifiers on a class imbalanced training set tend to achieve poor generalization performance especially for the minority class examples.In light of the advantage of ensemble learning in improving the generalization performance of individual classifiers,this thesis tries to employ ensemble learning to improve the generalization performance of base classifiers in the class imbalanced training scenario and proposes an evolutionary under-sampling based Bagging ensemble algorithm EUS-Bag.In EUSBag,in order to make evolutionary under-sampling EUS better suited to the Bagging framework such that a set of good-performance and diverse individual classifiers can be generated,a new fitness function considering three factors is designed to better combine the advantages of EUS and Bagging.The conducted comparison experiments on class imbalanced data sets demonstrate the satisfactory performance of the proposed algorithm.
Keywords/Search Tags:Ensemble Learning, Bagging, Ada Boost, DECORATE, Margin Theory, Ensemble Pruning, Robustness, Class Imbalanced Problem
PDF Full Text Request
Related items