Integrated feature subset selection/extraction with applications in bioinformatics

Posted on:2007-01-09

Degree:Ph.D

Type:Thesis

University:State University of New York at Buffalo

Candidate:Xu, Xian

Full Text:PDF

GTID:2448390005970991

Subject:Computer Science

Abstract/Summary:

Feature subset selection and extraction algorithms are actively and extensively studied in machine learning literature to reduce the dimensionality of feature space, since high dimensional data sets are generally not efficiently and effectively handled by a large array of machine learning and pattern recognition algorithms. When we stride into the analysis of large scale bioinformatics data sets, such as microarray gene expression data sets, the high dimensionality of feature space compounded with the low dimensionality of sample space, creates even more problems for data analysis algorithms.;Two foremost characteristics of microarray gene expression data sets are: (1) the correlation between features (genes) and (2) the availability of domain knowledge in computable format. In this dissertation, we will study effective feature selection and extraction algorithms with applications to the analysis of the new emerging data sets in the bioinformatics domain. Microarray gene expression data set, the result of large scale RNA profiling techniques, is our primary focus in this thesis. Several novel feature (gene) selection and extraction algorithms are proposed to deal with peculiarities on microarray gene expression data set.;To address the first characteristic of the microarray gene expression data set, we first propose a general feature selection algorithm called Boost Feature Subset Selection (BFSS) based on permutation analysis to broaden the scope of selected gene set and thus improve classification performance. In BFSS, subsequent features to be selected focus on those samples where previously selected features fail. Our experiments showed the benefit of BFSS for t-score and S2N (signal to noise) based single gene scores on a variety of publicly available microarray gene expression data sets.;We then examine the correlations among features (genes) explicitly to see if such correlations are informative for the purpose of sample classification. This results in our gene extraction algorithm called virtual gene. A virtual gene is a group of genes whose expression levels are combined linearly. The combined expression levels of a virtual gene instead of the real gene expression levels are used for sample classification. Our experiments confirm that by taking into consideration the correlations between gene pairs, we could indeed build a better sample classifier.;Microarray gene expression data set only represents one aspect of our knowledge of the underlying biological system. Currently there are lots of biological knowledge in computable format that can be accessed from Internet. Continue to address the second characteristic of the microarray gene expression data set, we investigate the integration of domain knowledge, such as those imbedded in gene ontology annotations, for the use of gene selection and extraction. (Abstract shortened by UMI.).

Keywords/Search Tags:

Selection, Extraction, Feature, Gene, Subset

Related items

1	Research On Feature Selection Algorithm Base On Gene Expression Data
2	Relevant gene subset selection: The maximum margin criterion in SVM and genetic algorithm
3	SVM Based Feature Selection Algorithms For Classification
4	Research On Feature Gene Selection Method Based On Genetic Algorithm
5	Research On Feature Gene Selection Method Based On Sample Weighting
6	Research On Extraction Of Feature Gene Subset Based On A Hybrid Between Genetic Arithmetic And Support Vector Machines
7	Research And Application Of Integrated Feature Selection Algorithm Based On Extreme Learning Machine
8	Research On Representative Feature Subset Generation Methods For Pedestrian Detection
9	Study On Selection For Feature Gene Subset In Microarray Expression Profiles Based On A SVM And GA Hybrid Algorithm
10	Research On Gene Selection Based On Max-Relevance And Min-Redundancy Feature Selection Algorithm