SVM Based Research On Feature Selection Method For Gene Expression Data

Posted on:2009-02-10

Degree:Master

Type:Thesis

Country:China

Candidate:X D Zhang

Full Text:PDF

GTID:2178360245463592

Subject:Signal and Information Processing

Abstract/Summary:

PDF Full Text Request

Microarray gene expression data often consists of small number of samples and large number of genes, the ultra high dimension of gene expression data makes it necessary to develop effective feature selection methods in order to select few genes that are most relevant to disease, hence reduce the computation cost and improve the classification accuracy. Now there are several feature selection methods applied to gene expression data, such as Sequential Forward Selection, GA, S2N and so on.Support vector machine(SVM) is a new kind of machine learning method based on statistical learning theory. SVM can significantly solve small-sample problems by using structural risk minimization(SRM). Furthermore, nonlinear problems are changed into linear ones by employing kernel function to map the low dimension original space to high dimension feature space, which makes the algorithm realized easily.The problem of feature gene selection and tumor samples classification of microarray gene expression data is one of challenges of gene microarray technology. This thesis improves tumor samples classification of gene expression data in two aspects: classification algorithm and feature gene selection method. Theory and application of S2N(Signal to Noise Ratio), K-means clustering algorithm, support vector machine, k-fold cross-validation, SMO(Sequential Minimal Optimization) are studied in one typical gene expression data set. Improved S2N method as well as the combination of K-means and improved S2N are two feature selection methods used to select feature gene, after which SVM is used as a classifier for tumor samples classification. And in the SVM training algorithm, SMO and new kernel functions--erbf, kmod are applied. Moreover, k-fold cross-validation is used to evaluate the algorithm of classifier, the purpose is to improve the accuracy and speed of classification. The LIBSVM software is also used to classify the samples. The classifier model: first, feature genes are extracted from the original leukaemia gene expression data by the method of improved S2N or the combination of K-means and improved S2N to reduce the dimension of samples; then, the selected data set is normalized by min-max method; in the end, a SVM classifier is built, and the classification results are analyzed. The results of experiment show the proposed methods of this paper are feasible and have some practical value.

Keywords/Search Tags:

Gene Expression Data, Support Vector Machine, Feature Selection, S2N(Signal to Noise Ratio), SMO(Sequential Minimal Optimization)

PDF Full Text Request

Related items

1	Support Vector Machine And Its Application In Gene Expression Data
2	Gene Selection And Cancer Classification Based On Optimization Algorithm And Support Vector Machine
3	The Application Research Of Support Vector Machine In Non-spherical Distribution Data Set And Tumor Gene
4	Tumor Classification Based On Gene Expression Studies
5	Data Analysis Of Cancer Gene Expression Based On SVM-RFE Algorithm
6	Study On SVMs-based Classification Of Gene Expression Data
7	Mining Method Based On Gene Expression Profiling Data
8	The Research And Optimization On Support Vector Machines Algorithm
9	Acceleration And Application Of Support Vector Machines
10	Research On Significant Genes Selection Method Based On PSO Algorithm