An empirical study of feature selection in binary classification with DNA microarray data

Posted on:2006-04-06

Degree:Ph.D

Type:Dissertation

University:Rice University

Candidate:Lecocke, Michael Louis

Full Text:PDF

GTID:1458390008466477

Subject:Statistics

Abstract/Summary:

Motivation. Binary classification is a common problem in many types of research including clinical applications of gene expression microarrays. This research is comprised of a large-scale empirical study that involves a rigorous and systematic comparison of classifiers, in terms of supervised learning methods and both univariate and multivariate feature selection approaches. Other principle areas of investigation involve the use of cross-validation (CV) and how to guard against the effects of optimism and selection bias when assessing candidate classifiers via CV. This is taken into account by ensuring that the feature selection is performed during training of the classification rule at each stage of a CV process ("external CV"), which to date has not been the traditional approach to performing cross-validation. Results. A large-scale empirical comparison study is presented, in which a 10-fold CV procedure is applied internally and externally to a univariate as well as two genetic algorithm-(GA-) based feature selection processes. These procedures are used in conjunction with six supervised learning algorithms across six published two-class clinical microarray datasets. It was found that external CV generally provided more realistic and honest misclassification error rates than those from using internal CV. Also, although the more sophisticated multivariate FSS approaches were able to select gene subsets that went undetected via the combination of genes from even the top 100 univariately ranked gene list, neither of the two GA-based methods led to significantly better 10-fold internal nor external CV error rates. Considering all the selection bias estimates together across all subset sizes, learning algorithms, and datasets, the average bias estimates from each of the GA-based methods were roughly 2.5 times that of the univariate-based method. Ultimately, this research has put to test the more traditional implementations of the statistical learning aspects of cross-validation and feature selection and has provided a solid foundation on which these issues can and should be further investigated when performing limited-sample classification studies using high-dimensional gene expression data.

Keywords/Search Tags:

Classification, Feature selection, Gene, External CV, Empirical

Related items

1	Selected Based On The Gene Expression Profiles Of Tumor Characteristic Gene Studies
2	Study On Feature Selection Method For Classification Of Gene Expression Data
3	The Application Of Feature Selection In Gene Expression Data Analysis
4	Research On Feature Gene Selection Method Based On Genetic Algorithm
5	Research On Feature Gene Selection Method Based On Sample Weighting
6	Application Of AdaBoost In Gene Expression Data Classification
7	Research On Gene Selection Based On Max-Relevance And Min-Redundancy Feature Selection Algorithm
8	Research On Feature Selection For Classification In Microarray Gene Expression Data
9	Gene Name Recognition Feature Selection Methods In Biomedical Research Text
10	Study Of Efficient Feature Selection And Classification Methods For Gene Expression Microarray Datasets