Font Size: a A A

Research On Feature Selection And Classification For DNA Microarray

Posted on:2010-09-12Degree:MasterType:Thesis
Country:ChinaCandidate:M K TanFull Text:PDF
GTID:2178360275982053Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
This thesis is supported by the Excellent Youth Foundation of Hunan Province (06JJ1010): Research on the gene selection and tumor detection based on DNA Microarray. As a brief summary, this thesis concentrates two highly correlated challenging tasks in DNA Microarray data analysis: (1) the selection of the optimal classifier; 2) the identification of the optimal feature (gene) subset.Classification plays a key role in both pattern recognition and data mining. In recent years, some classifiers have been developed, such as Support Vector Machine (SVM), k-nearest neighbor (k-NN), C4.5, and multi-layer perceptron etc. Considering that the Microarray data usually occupies a small sample size, SVM which is aimed at solving classification problem with small samples of training with good generalization ability may be a good choice. However, the performance of SVM is dramatically affected by certain parameters including the kernel parameters and the soft margin parameter C. To obtain good generalization, it is important to choose sufficiently good parameter sets for particular learning problem, which is also known as the model selection in pattern recognition. This thesis presents a hybrid strategy combining a comprehensive learning particle swarm optimizer (CLPSO) with Broyden-Fletcher-Goldfarb-Shanno (BFGS) method for effectively tuning the parameters of SVM. The hybrid CLPSO-BFGS algorithm, which combines the global search ability of the CLPSO and the local search ability of BFGS, can effectively computing the multiple global optima of multimodal functions. In chapter 2, this thesis first introduces the general framework of the hybrid PSO-BFGS strategy as well as the numeric experiments. In the third chapter, this thesis will present the details of the SVM model selection based on the CLPSO-BFGS algorithm. The experimental results show that the proposed method can efficiently tune the parameters of both L1-SVM and L2-SVM and achieve competitive performance compared with other optimized classifiers.Apart from the classifier, the high dimensionality is another factor that has a dramatic influence on the classification performance. Usually many real datasets in pattern recognition applications contain a large quantity of noisy and redundant features that are irrelevant to the intrinsic characteristics of a dataset. The irrelevant features may seriously deteriorate the learning performance. Hence feature selection which aims to select the most informative features from the original dataset plays an important role in data mining such as image recognition and Microarray data analysis. In Microarray dataset, it usually contains thousands of genes (most of them are proved to be redundant) with a small number of samples. Therefore, feature selection, also known as the gene selection, plays particularly important role for Microarray data analysis.In this thesis, we first present a new gene selection and tissue classification method based on SVM and the genetic algorithm (GA). In this method, firstly, the Wilcoxon-test is used as a coarse gene selection method to remove most of the irrelevant genes. Then the fine selection on the basis of its classification ability of a single gene with SVM is conducted to obtain the final gene subset. Finally, the genetic algorithm is used to optimize the SVM model with the final gene subset. The experimental results on the Leukemia and Breast Cancer dataset show that the proposed strategy is effective and competitive to the previous methods.Furthermore, this thesis proposes a general feature score for measuring the importance of features (or genes) based on the recently developed graph embedding framework on manifold learning. We first prove that the recently developed feature scores can be seen as a direct application of the graph preserving criterion. And then, based on the Marginal Fisher Analysis (MFA), a new feature score, named as MFA score for supervised feature selections is developed. By using the manifold learning and the maximum margin idear, MFA score can successfully identify the nonlinear discriminative features. Besides, the proposed general feature score can be easily applied for the unsupervised, semi-supervised and multiclass feature selection problems. The experimental results both on toy datasets and real-word datasets verify the good performance of the proposed method.
Keywords/Search Tags:Particle swarm optimization (PSO), BFGS method, Support vector machine, Kernel method, Model selection, Dimension reduction, Gene selection, Graph embedding
PDF Full Text Request
Related items