Font Size: a A A

Study On Feature Selection And Ensemble Learning Based On Feature Selection For High-Dimensional Datasets

Posted on:2005-06-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:L X ZhangFull Text:PDF
GTID:1118360152468081Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The emergence of high-dimensional machine learning fields such as image processing, information retrieval and bioinformatics pose severe challenges to the existing feature selection and machine learning algorithms. This dissertation mainly studies on feature selection and ensemble learning based on feature selection for high-dimensional datasets. Contributions in this dissertation mainly include:(1). Two two-phase combined feature selection algorithms are designed based on Relief evaluation algorithm. One is with filter-filter model, and the other is with filter-wrapper model. For the filter-filter model, in the first phase, Relief algorithm is used to filter the irrelevant features; in the second phase, correlation analysis is utilized to remove the redundant features. For the filter-wrapper model, the first phase is the same with filter-filter model, while in the second phase, backward sequential search algorithm is used to remove the redundant features with the performance of the induction algorithm to be used after feature selection used as evaluation for the feature subsets. Experiments on artificial and real datasets illuminate that the filter-wrapper combined model outperforms filter-filter model with respect to accuracy while is much slower than filter-filter model, and experiments on artificial datasets illuminate that filter-filter combined model can remove all or equal all redundant features.(2). Based on the merits and demerits of Relief and genetic algorithm in wrapper model, a coupling model of Relief and genetic algorithm is proposed, which uses the feature evaluation of Relief to instruct the initialization of genetic population, the coupling model aims to improve the efficiency of genetic algorithm which use the performance of the classifier as evaluation of feature subsets. Experiments on 17 relatively high-dimensional datasets show that, the algorithm has good comprehensive performance with respects to accuracy, size of feature subsets, and efficiency.(3). Considered about the accuracy of individual classifier and diversity among individual classifiers, this dissertation proposes an ensemble learning algorithm based on two-phase feature selection for high-dimensional datasets. Experiments validate that on high-dimensional datasets, accuracy of ReFeatEn is always higher or equally good as Bagging, Boosting and the random subspace ensemble algorithm RandFeatEn. The efficiency of ReFeatEn is much greater than Bagging and Boosting, and also can be run in parallel, so ReFeatEn is very fit for high-dimensional problems.(4). Propose the hypothesis of embedding feature selection into Boosting algorithm, and design a general algorithm structure. Accordingly corresponding ensemble learning algorithms for na?ve Bayesian classifier and nearest mean classifier are designed. Experiment results and analysis show that this novel coupling algorithm solve the problem that Boosting is sensitive to noise features and samples, and gain accuracy which is remarkably higher than the Boosting algorithm, and is robust and easy to be extended for other classifiers.
Keywords/Search Tags:Feature selection, high-dimensional, ensemble learning, Relief, genetic algorithm
PDF Full Text Request
Related items