Font Size: a A A

Study On Selection For Feature Gene Subset In Microarray Expression Profiles Based On A SVM And GA Hybrid Algorithm

Posted on:2007-01-31Degree:MasterType:Thesis
Country:ChinaCandidate:W XiongFull Text:PDF
GTID:2178360182996346Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The technology of microarray is a new technology with the developmentof life science and information technology and has become an efficientmethod of finding the information of biology molecule. And the microarrayis the most widely used technology in the fields of bioinformatics. Oneimportant application of the microarray is measuring the activity of differentgenes in a cell sample. The advent of microarray makes it possible to performgene diagnosis and gene treatment.The microarray application is implemented by analyzing andresearching gene data in the experiment. One technological breakthrough thatleads to the trend is the invention of microarrays. A microarray is a small chip(about one and a half inch wide) that contains an array of chemical reactionspots. The chemicals in each spots are designed to react with some differentchemicals in the test samples. By using proper dyes, the amount of reactedchemicals in each spot can be quantified by measuring the light intensity atsome specific frequencies. The set of quantities obtained from the wholearray is called an expression profile of the sample. Depending on thetechnology employed, each quantity in a gene expression profile representseither the absolute expression level or a relative expression ratio. Due to thecomplex multistep experimental procedures, gene expression profiles maycontain missing and noise values. It is therefore a must to perform properdata analysis and data selection.Current applications of microarray focus on precise classification ordiscovery of biological types, for example tumor versus normal phenotypesin cancer research. Several challenging scientific tasks in the post-genomicepoch, like hunting for the complex diseases genes from genome-wide geneexpression profiles and thereby building the corresponding gene networks,are largely overlooked because of the lack of an efficient analysis approach.Aiming at the above-mentioned problems, a feature selection method isproposed to find a feature gene subset in this thesis, based on a SVM and GAhybrid algorithm. The major contents are summarized as follow:(1) The introduction and the data characteristics of the microarray, thecurrent researching, and the practical applications are reviewed.The technology of microarray is a new technique of gene analysis. Dueto its low cost, high sensitivity and high flux, microarray is one of theimportant tools for the study of functional genome, which is obviously betterthan the previous research model of single gene. The network mechanisms ofgene expression regulation are thoroughly studied in the level of wholegenome using microarray.(2) The mainly algorithms of the feature selection are described and thealgorithms of the Support Vector Machine (SVM) are deduced. Furthermore,the characteristics of the genetic algorithm are introduced.In chapter 2, the correlative definitions and the mainly algorithms of thefeature selection problem are introduced, like simulated annealing algorithm.And in the thesis we introduce the VC dimension, bounds on generalizationability of a learning machine, structural risk minimization principle instatistical learning theory and deduce the algorithms of the Support VectorMachine in detail. Meanwhile, we deduce the computational process of thegenetic algorithm and introduce the important conceptions, for example,fitness function, selection operator, crossover operator and mutation operator.(3) Aiming at the gene selection problems, the hybrid algorithm forselecting the feature gene subset in microarray expression profiles isproposed.In the data process, the traditional statistics can't get good effort whensamples are limited and high dimensions. In chapter 3, we first find a featuregene subset and filter most genes which are unrelated with diseases accordingto certain significant level, gene importance and classification efficiency byLeast Square Support Vector Machine. On the basis of it, we apply animproved genetic algorithm to carry out feature selection according to theircontribution to classifying. In the method, the crossover and mutationoperators in the genetic algorithm are improved such that the feature genenumber of the subset could be controlled during the process of geneticoperation, and the information entropy is used as separate criterion, and thenthe selected feature subset is evaluated by Support Vector Machine and themethod of leave-one-out.(4) Application the newly proposed feature selection algorithm on theexpression data of microarray.In chapter 4, we apply the proposed feature selection algorithm on theexpression data of microarray (colon dataset and leukemia dataset), evaluatethe feature gene subsets that are obtained in different conditions.Comparisons of the result from the hybrid algorithm with the result fromother algorithm are performed. Numerical results show that goodeffectiveness of the proposed hybrid algorithm is obtained. Support VectorMachine has not only simple structure, but also better performances,especially better generalization ability. Meanwhile, GA has advantages inimplicit parallelism, global optimum searching and simple operability. Sousing the algorithm in this thesis does not only get the good classificationefficiency, but also obtain the important genes which are related withdiseases.
Keywords/Search Tags:Microarray
PDF Full Text Request
Related items