Font Size: a A A

SVM Based Research On Feature Selection Method For Gene Expression Data

Posted on:2009-02-10Degree:MasterType:Thesis
Country:ChinaCandidate:X D ZhangFull Text:PDF
GTID:2178360245463592Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Microarray gene expression data often consists of small number of samples and large number of genes, the ultra high dimension of gene expression data makes it necessary to develop effective feature selection methods in order to select few genes that are most relevant to disease, hence reduce the computation cost and improve the classification accuracy. Now there are several feature selection methods applied to gene expression data, such as Sequential Forward Selection, GA, S2N and so on.Support vector machine(SVM) is a new kind of machine learning method based on statistical learning theory. SVM can significantly solve small-sample problems by using structural risk minimization(SRM). Furthermore, nonlinear problems are changed into linear ones by employing kernel function to map the low dimension original space to high dimension feature space, which makes the algorithm realized easily.The problem of feature gene selection and tumor samples classification of microarray gene expression data is one of challenges of gene microarray technology. This thesis improves tumor samples classification of gene expression data in two aspects: classification algorithm and feature gene selection method. Theory and application of S2N(Signal to Noise Ratio), K-means clustering algorithm, support vector machine, k-fold cross-validation, SMO(Sequential Minimal Optimization) are studied in one typical gene expression data set. Improved S2N method as well as the combination of K-means and improved S2N are two feature selection methods used to select feature gene, after which SVM is used as a classifier for tumor samples classification. And in the SVM training algorithm, SMO and new kernel functions--erbf, kmod are applied. Moreover, k-fold cross-validation is used to evaluate the algorithm of classifier, the purpose is to improve the accuracy and speed of classification. The LIBSVM software is also used to classify the samples. The classifier model: first, feature genes are extracted from the original leukaemia gene expression data by the method of improved S2N or the combination of K-means and improved S2N to reduce the dimension of samples; then, the selected data set is normalized by min-max method; in the end, a SVM classifier is built, and the classification results are analyzed. The results of experiment show the proposed methods of this paper are feasible and have some practical value.
Keywords/Search Tags:Gene Expression Data, Support Vector Machine, Feature Selection, S2N(Signal to Noise Ratio), SMO(Sequential Minimal Optimization)
PDF Full Text Request
Related items