Font Size: a A A

Research On Feature Selection For Machine Learning

Posted on:2014-02-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:X SunFull Text:PDF
GTID:1228330395496534Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Machine learning is the process of analyzing data from different perspectives andextracting it into useful information. The performance of learn model constructed by thetraining samples is mostly depended on the quality of the dataset. Along with the newemergences of computer applications, such as social networks clustering, gene expressionarray analysis and combinatorial chemistry, datasets are getting larger and larger. Nevertheless,most of the features in huge dataset are irrelevant or redundant, which lead traditional miningand learning algorithms to low efficiency and over-fitting. To mitigate this problem, oneeffective way is to reduce the dimensionality of feature space with feature selection technique.Feature selection, also known as variable selection, is one of the fundamental problems in thefields of machine learning, pattern recognition and statistics. Feature selection aims at findinga good feature subset which produces higher classification accuracy. It can bring lots ofbenefits to machine learning algorithms, such as reducing the measurement cost and storagerequirements, coping with the degradation of the classification performance due to thefiniteness of training sample sets, reducing training and utilization time, and facilitating datavisualization and data understanding. It has attracted great attention and many selectionalgorithms have been developed during past years. Previous reviews of feature selection canbe found in literatures. Generally, all of these selection algorithms typically fall into threecategories: embedded, wrapper and filter methods. Filter methods are independent of learningalgorithms and assess the relevance of features by looking only at the intrinsic properties ofthe data. In practice, filter methods have much lower computational complexity than others,meanwhile, they achieve comparable classification accuracy for most classifiers. Thus, thefilter methods are very popular to high-dimension dataset. It is noteworthy that among variousfeature selection algorithms, information theoretic based algorithms achieve excellentperformance and have drawn more and more attention. However, most of these selectorsdiscard features which are highly correlated to the selected ones although relevant to thetarget class, which is likely to ignore some features having strong discriminatory power as agroup but weak as individuals. The main reason for this disadvantage is that informationtheoretic based measurements disregard the intrinsic structure among features.To untie this knot, this work focuses on how to select a maximal relevance,interdependence and minimal redundancy feature subset for machine learning. This thesisproposes two different kinds of feature selection algorithms and one optimization method forinformation theoretic based feature selection algorithms. It also introduces a gene selectionalgorithm for cancer diagnosis application. The main contribution of this paper and innovativepoints are as follows: (1) Fristly a comprehensive overview on the state-of-the-art of the feature selectionalgorithms has been done. Then an analysis and discussion about the problems faced bycurrent filter selection algorithms is presented. The discussion and analysis of these topicshave laid a solid foundation for carrying out further research work.(2) This thesis designs a feature evaluation and selection framework based on Banzhafpower index. The framework firstly introduces a cooperative game theory based framework toevaluate the power of each feature, in order to overcome the disadvantage that traditionalInformation-theoretic based selectors ignore some features which as a group have strongdiscriminatory power but are weak as individuals. Then a filter feature selection process withmRMR criterion is proposed to handle the feature selection problem. Experimental resultsshow that the proposed method works well. Its proven efficiency and effectiveness comparedwith other algorithms by four classifiers suggest that the proposed method is practical forfeature selection of high-dimensional data.(3) Considering there are so many outstanding feature selection algorithms, this thesisintroduces an optimization method for feature selection algorithms based on cooperative gametheory. A feature evaluation algorithm based on Shapley value is proposed to evaluate theweight of each feature according to its influence to the intricate and intrinsic interrelationamong features. Comparing with the Banzhaf power index method, Shapley value will favorsmaller winning coalitions more which are much helpful for small feature subset selection.Moreover approximate joint mutual information and joint conditional mutual information areintroduced to evaluate the independence and redundance among features. Experimental resultssuggest that the proposed framework is practical for optimizing feature selection algorithms.(4) This thesis also presents a feature selection algorithm based on dynamic weights. Itfirstly introduces a new scheme for feature relevance, interdependence and redundancyanalysis using information theoretic criteria. Then, a dynamic weighting-based featureselection algorithm is presented, which not only selects the most relevant features andeliminates redundant features, but also tries to retain useful intrinsic groups of interdependentfeatures. The primary characteristic of the method is that the feature is weighted according toits interaction with the selected features. And the weight of features will be dynamicallyupdated after each candidate feature has been selected. The experimental results indicate thatour proposed method achieves promising improvement on feature selection and classificationaccuracy.(5) Microarray analysis is widely accepted for human cancer diagnosis and classification.However the high dimensionality of microarray data poses a great challenge to classification.This thesis also introduces a new gene selection method for cancer diagnosis andclassification by retaining useful intrinsic groups of interdependent genes. The primarycharacteristic of this method is that the relevance between each gene and target will bedynamically updated when a new gene is selected. The effectiveness of our method isvalidated by experiments on six publicly available microarray data sets. Experimental results show that excellent classification accuracies are achieved by selecting the key genes using theproposed algorithm. Moreover the gene subset selected by the DRGS method is much moreenriched in gene sets which are related to cancer.These studies not only promote the futher development and application of the featureselection algorithm, but also suggest a new point of view to improve the classificationperformance by selecting independent feature subset. Therefore, these studies have importanttheoretical significance and application value as well.
Keywords/Search Tags:Machine learning, Pattern recognition, Feature selection, Information theory, Cooperative game theory, Banzhaf power index, Shapley value, Gene selection
PDF Full Text Request
Related items