Font Size: a A A

Research On Gene Expression Data Analysis

Posted on:2007-09-01Degree:MasterType:Thesis
Country:ChinaCandidate:W G ZhouFull Text:PDF
GTID:2178360182996033Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Bioinformatics is a new interdiscipline by integrating biology, appliedmathematics and computer science. It is one of the current key front fields oflife science and natural science. It will also be one of the core fields ofnatural science in the twenty-first century. With computer as tools, it tries tounderstand and reveal the biological meaning under data by obtaining, storing,processing, retrieving and analyzing those biological data. The current mainresearch contents contain sequence alignment, single nucleotidepolymorphism, gene finding, prediction of protein structures and functions,analysis of gene expression data, recognition of all kinds of signal sites,conserved sequence and structure characters etc.Computational intelligence has broad research fields. It mainly consists ofneural networks, evolutionary computation, fuzzy system, swarm intelligence,immune algorithm and so on. DNA computing, artificial life, rough set theoryand support vector machine can also be deemed in the research domain ofcomputational intelligence. This paper introduces the relevant models andalgorithms in computational intelligence to the research of bioinformatics,proposes some new methods for solving those hot and hard problems in geneexpression data analysis, and has made some better results.In order to extract knowledge from the huge amount of gene expressiondata produced in microarray experiments, clustering analysis for microarraydata has become an important task. The purpose is to to identify groups ofgenes with similar patterns and hence similar functions. Several clusteringalgorithms have been proposed for microarray data analysis, such ashierarchical clustering, self-organizing maps, and k-means. There are alsosome graph theoretic approaches like HCS, CAST, CLICK, and MST. Butthese conventional methods often converge slowly and can not find theoptimum solution. In this paper, we first introduce a novel evolutionarycomputation method so called quantum-inspired evolutionary algorithm(QEA), then improve it and first apply it to minimum sum-of-squaresclustering problem. The improved algorithm uses a new representation formand adds an additional mutation operation. We present experimental evidencethat our method is highly effective and produces better solutions than theconventional k-means and self-organizing maps clustering algorithms evenwith a small population. In order to observe the convergence rate, the curvesof fitness change are given for the HL-60 and Yeast data sets respectively. Asit shows in the figure, QEA always converges faster than other twoalgorithms. Although the solutions have similar fitness values, the clustersproduced by different algorithms may differ drastically. So the distancebetween the best solutions found by two conventional algorithms and theQEA algorithm is given. Just as it indicates in the result, the proposedalgorithm has the ability of finding the global optimum solution.The genes with similar expression patterns have been allocated to onecluster after clustering analysis. These genes are often called coexpressedgenes. A widely used strategy for identification of regulatory binding sites isthat coexpressed genes may share common regulatory elements. Manycomputational methods such as consensus, Gibbs motif sampler,BioProspector and ANN-SPEC have successfully applied this strategy infinding regulatory elements from lower organisms such as bacteria and yeast.But even the general principles governing the locations of DNA regulatoryelements in higher eukaryotic genomes remain unknown. Existing methodsoften converge on sequence motifs that are not biologically relevant. So theenrichment of computational methods is necessary for an efficient search. Inthis paper, we first formulate the identification of transcription factor bindingsites (TFBS) motif as a combinatorial optimization problem. Then hybridparticle swarm optimization is proposed to solve it. We suggest two operatorsfor intensifying the local search and one recombination mutation operator forincreasing the variance of the population. The simulation results show thatthe proposed algorithm can successfully identify the TFBS motif preciselywithout pre-alignment and using a Gibbs sampling process from the upstreamregion of coexpressed genes regulated by Oct. The two near optimumsolutions are also given. They can more probably represent novel TFBS thatare not discovered yet. The advantage of this algorithm is that it runs veryfast and can get better solutions with small population size and small searchspace. This study demonstrates the feasibility of heuristic methods andprovides a new perspective for TFBS motif identification.In addition, mining gene expression data can also obtain cancerinformation and identify many genes relevant to cancer. Therefore cancerclassification has become a current important issue in microarray dataanalysis. But microarray data often consists of small number of samples andlarge number of genes. The ultra high dimension of gene expression datamakes it necessary to develop effective feature selection methods in order toselect few genes that are most relevant with disease and hence reduce thecomputation cost and improve the classification accuracy. Mutualinformation has recently been proposed for feature selection. But it oftencontains redundancy in the feature set selected by this method. Attributereduction in rough set theory provides a feasible way to deal with redundancyand does not reduce the contained information. In this paper, we integratemutual information and rough set theory, and propose a novel featureselection method called MIRS. First, mutual information is used to selectsome top-ranked genes which have higher mutual information from each dataset. Then rough set theory is applied to remove the redundancy among theseselected genes. Finally, the effectiveness of MIRS is evaluated by the highclassification accuracy of two SVM classifiers. Binary particle swarmoptimization is first suggested as an attribute reduction algorithm for roughset. Experiment results show that the proposed method is superior to someother classical feature selection methods such as principal components andcorrelational coefficient in two cancer microarray data sets and can alwaysget higher classification accuracy with fewer features.
Keywords/Search Tags:Expression
PDF Full Text Request
Related items