Font Size: a A A

Study On Feature Selection Method For Classification Of Gene Expression Data

Posted on:2019-04-24Degree:MasterType:Thesis
Country:ChinaCandidate:X Q YeFull Text:PDF
GTID:2428330548456586Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Gene expression profile is a collection of gene expression information.Study has shown that,at the molecular level,cancer usually appears as a change of gene expression level.Obviously,that we can use these gene expression data to identify genes that are closely related to cancer will have great influences on the diagnosis and treatment of cancer.Gene expression data usually is with the characteristics of high dimension and small sample,which poses a challenges to traditional machine learning methods.Consequently,a large number of irrelevant genes need to be removed from thousands of genes before distinguishing a small number of pathogenic genes.As is known to all,feature selection is an effective approach.In this paper,four public microarray datasets is used as experimental objects,and feature selection algorithm is applied to screen out genes that are differentially expressed in diseases,and the classification performance is used as an evaluation index of our gene selection algorithm.Focusing on the gene selection problem of microarray datasets,this paper mainly did the following work:1)The data of gene expression data represent the expression level of genes,and there is nonsequence between adjacent data.At the same time,noise is often included in the process of data produce.Based on this,a discretization method of data preprocessing is introduced in this paper.Comparing with other data preprocessing methods,it is verified that the discretization of gene expression data has better classification accuracy.2)For data with high dimensional and small sample size,the filtered feature selection algorithm can quickly and effectively get the leading features.However,the key features obtained by different filtering methods tend to be quite different and often with low classification stability.Therefore,in this paper,we propose an ensembled feature selection method,GSEF(Gene Selection base on Ensemble Filter),which based on the idea of ensemble learning.The experimental results show that our method has better classification performance than other single feature selection algorithms,and the classification stability is improved as well.3)The method of GSEF can quickly remove unrelated genes,but it can not eliminate redundant genes.In order to remove redundant genes,this paper proposes a multiple features selection algorithm SC-SVM-RFE which combines GSEF with spectral clustering and SVM-RFE.This algorithms was applied to four cancer gene expression datasets,compared with the classical method in three classifiers(SVM,KNN,NB).The experimental results show that the feature subset selected by our algorithms has better classification performance than SVM-RFE and GSEF algorithms,especially when the number of selected genes is small.
Keywords/Search Tags:Feature selection, High-dimensional small sample, Gene expression profile
PDF Full Text Request
Related items