Font Size: a A A

Research On Feature Selection For Classification In Microarray Gene Expression Data

Posted on:2009-11-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:L J ZhangFull Text:PDF
GTID:1118360278956589Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years, the rapid advances in microarray technology enable people to measure simultaneously the expression levels for thousands or tens of thousands of genes in a single experiment (Gene expression data obtained through microarray technology is called microarray gene expression data). Such high throughput capability offers great opportunities in terms of gene expression data collection but also poses great challenges in terms of mining the data.Classification is an important task in microarray gene expression data mining, where the purpose is to classify diseases or predict diagnostic categories of samples based on gene expression data. The process of classification based on microarray gene expression data is very traditional. However, the task of classification based on microarray gene expression data is more challenging, because the number of genes is large but the number of samples is very small. As a consequence, it is essential to identify some informative genes (or features) that contribute most to classification. Based on previous feature selection technologies and the characteristics of microarray gene expression data, this thesis deeply studies some key problems of feature selection in microarray gene expression data.In feature selection area, feature relevance is a very important notion, which reflects the contribution of a feature to classification. Many feature selection algorithms are directly based on the notion of feature relevance and use some relevance measure to estimate the goodness of feature subset in one way or another. Although the notion of feature relevance is widely used in feature selection, no definitions of feature relevance are satisfying, the measure of feature relevance varies, and the relationship between feature relevance and feature selection is still not described enough. In this thesis, we focus on the study of feature relevance, relevance measure and effective feature selection algorithms for microarray gene expression data.Feature relevance measures are used to estimate the relevance of a feature (or feature subset) and the class label. Many measures exist in machine learning and data mining area, different measures are fit for different data. In existing relevance measures, some require a large number of samples and require the data come from some kind of distribution. Some require discrete data value. These requirements are apparently not satisfied by microarray gene expression data, in which the number of tissue samples is very limited, and the data value is numerical. In this thesis, we propose a method based on Grey Relational Analysis (GRA) to measure feature relevance and develop a Gene Ranking method based on GRA (called GR-GRA) for microarray gene expression data. GRA is from grey theory, which requires less data, does not rely on data distribution and is fit for microarray gene expression data. Feature relevance is an important notion in feature selection. Many different definitions of feature relevance exist in machine learning and data mining literatures. However, existing definitions are all qualitative, only depend on the probability distribution of the data and are all independent of the relevance measure and the classifier. Such definitions result in the following problems:Different relevance measures are usually based on different theory and have different properties. The definitions that do not consider relevance measures can result in the following problem: a feature that may be relevant based on one measure may be irrelevant based on another measure, so we can not determine whether it is relevant or irrelevant. To overcome this problem, we propose new definitions of feature relevance that depend on relevance measures, and develop an effective filter algorithm (called FRADM) for feature selection in microarray gene expression data. Comprehensive experiments show that FRADM is effective and efficient for microarray gene expression data.Many studies show, when relevance is not tied to a specific classifier, the definition of relevance is in fact of little use: Relevance of a feature does not imply that it is useful for classification and irrelevance of a feature does not imply that it is useless. And different algorithms have different biases and a feature that may help one algorithm may hurt another. We therefore study the effect of classifiers on feature relevance, propose new definitions of feature relevance that depend on classifiers and develop a novel wrapper algorithm (called WR) for feature selection based on the new definitions. Plentiful experiments show that WR can improve classification accuracy to a very high level.Finally, we abstract the above new proposed definitions of feature relevance into a generalized definition of feature relevance. Based on the generalized definition of feature relevance, we summarize the two new developed algorithms FRADM and WR into a unifying algorithm framework. Then we analyze the strongpoint and shortcoming of FRADM and WR respectively, and propose a novel hybrid strategy, which is used to combine FRADM and WR into a novel hybrid algorithm (called HFW).In summary, this thesis studies deeply feature relevance and relevance measure in feature selection, present several comprehensive distinctions and definitions of feature relevance, propose a new relevance measure that is fit for microarray gene expression data, and develop several effective feature selection algorithms for microarray gene expression data, which has academic and practical value for advancing the theory and practicability of feature selection in high dimensional data.
Keywords/Search Tags:microarray gene expression data, classification, feature selection, feature relevance, feature relevance measure
PDF Full Text Request
Related items