Font Size: a A A

The Research On Ensemble Pruning And Its Application In On-Line Machine Learning

Posted on:2011-02-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q L ZhaoFull Text:PDF
GTID:1118330341451746Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Machine learning has been a key technique in many applications. It analyzes data and obtains information and knowledge, which will be used to direct the future decisions. With the rapid development of the Internet, it becomes very easy to acquire data in many applications. But how to get useful information from the increasing data becomes a hard problem to triditionoal machine learning techniques. For many the Internet-based applications such as on-line business, on-line advertising, computational finance and search engines, it is very important to develop new methods of machine learning, which can process numerous and continuously accumulating data rapidly and efficiently.On-line machine learning is an effective way to analyze enormous data on-line, and the prediction performance and prediction efficiency of the classifiers are the main metrics to evaluate an on-line algorithm. As one of the most important on-line learning strategies, incremental learning can be organized into two categories: single-classifier based and ensemble based incremental learning. In recent literature, the methods of single-classifier based incremental learning usually get low generalization because it inclines to overfitting the training data. But on the other hand, the methods of ensemble based incremental learning lead to increasing size for the target ensemble and to more and more spatial and temporal overhead for predicting as running of the system.Ensemble pruning has been proved to be a promising way to improve the prediction performance and efficiency for ensembles. This thesis aims at supervised classification problems, and proposes a new idea that applying ensemble pruning techniques to on-line learning. In the thesis, a new on-line learning model is presented at first. Then, several key techniques upon this model are discussed in depth. The main contributions of the thesis are as follows:1. A new on-line learning model, EPIL, is proposed.Focusing on the requirements of user applications and the limitations of current on-line learning methods, this thesis presents a new on-line learning model, EPIL (Ensemble Pruning based Incremental Learning), and discusses the technical challenges for this model. In EPIL, when a new incremental dataset arrives, a number of base classifiers are generated at first. Next, by a local ensemble pruning only the base classifiers with higher prediction performance are added to the target ensemble. Then if the condition is satisfied, the target ensemble is pruned by a global ensemble pruning. The goal of EPIL model is to acquire a small target ensemble with good on-line learning capability, that is, higher prediction performance and lower prediction cost.2. The strategy and framework of pattern-mining based ensemble pruning are explored.For the performance and efficiency challenges of ensemble pruning in EPIL, the strategy of pattern-mining based ensemble pruning is proposed. In the proposed scheme, ensemble pruning is represented as a pattern mining problem, so the techniques of transaction process and pattern mining can be used in base classifiers selection. This strategy initiates a new direction for the research of ensemble pruning. Finally the framework for the algorithms of the strategy is presented, and the related techniques are discussed in detail.3. Two ensemble pruning algorithms based on the idea of pattern mining are presented.At first the thesis introduces a new concept: coverage-based pattern mining. Then based on this concept, two ensemble pruning algorithms, CPM-EP and PMEP are put forward. Both the algorithms use majority voting principle to get candidate sub-patterns, and applies coverage pattern mining and a greedy strategy to acquire candidate target ensemble But to get the candidate sub-patterns PMEP builds a FP-Tree on the transaction database, then from the tree the candidate sub-patterns for all possible ensemble size are acquired, without accessing to the original transaction database any more, so much time saved. Experimental results show that both CPM-EP and PMEP achieves good prediction accuracy, low pruning time and small ensemble size. Additionally, the pruning time of PMEP is even better than CPM-EP. The results proved that pattern mining is an effective strategy for ensemble pruning.4. An algorithm of Bagging based incremental learning is given.Base classifiers generating is the foundation of EPIL model. In order to work with heterogeneous base classifiers, this thesis presents a new ensemble based incremental learning method, Bagging++, which is inspired by the famous ensemble method, Bagging. Then to improve the diversity of base classifiers, a simple algorithm to generating heterogeneous base classifiers is provided. Experimental results show that Bagging++ has better generalization than all other related methods. Additionally, the utilization of heterogeneous base classifiers improves the prediction performance of target ensemble further.5. An ensemble pruning based incremental learning method is put forward.The thesis proposes the idea that applies ensemble pruning to ensemble incremental learning. Then two related algorithms are derived from it and applied to Bagging++. LP-Bagging++ only prunes local base classifiers produced during each incremental learning step, while MP-Bagging++ adds a phase of global pruning for all base classifiers of the target ensemble. Experimental results show that LP-Bagging++ and MP-Bagging++ have similar prediction performance, but since MP-Bagging++ can eliminate redundant base classifiers from the global aspect, it reduces the prediction cost sufficiently. The experimental results show that combination of the local and the global pruning is an effective way to improve both the performance and the efficiency for ensemble-based incremental learning.6. A development platform for ensemble learning is designed and implemented Based on the works mentioned above, an open source platform called LibEP is designed and implemented. This platform hosts the common algorithms from all facets of ensemble learning, including functions of data manipulating, base classifier generating, ensemble pruning, incremental learning and performance evaluating etc. LibEP has simple user interfaces, which is easy to use and to be integrated into user applications. The platform was implemented by standard C++ language, so it is OS-independent and extensible with high performance.From the facets of modeling, algorithm designing and experimenting, this thesis presented a new on-line machine learning method which combines the techniques of ensemble pruning and incremental learning. In the future research, the author will try to apply this technique to the real world applications which need to process data with high performance and high efficiency.
Keywords/Search Tags:On-line learning, ensemble pruning, incremental learning, pattern mining, transaction database, coverage-based pattern mining
PDF Full Text Request
Related items