Font Size: a A A

Research On Orthologs Cluster In Genomic Sequences Using Biclustering Algorithm

Posted on:2007-03-04Degree:MasterType:Thesis
Country:ChinaCandidate:L P WangFull Text:PDF
GTID:2178360182996153Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As a new analysis and research field of cluster in bioinformatics, thedevelopment of bicluster has brought out a whole new research view. Thebackground part of this thesis started with analysis of gene expressionspectrum, analysis of genomic sequence, analysis of phylogeny, gave briefintroduction to make a matting for the correlative research. Then, the thesisgave a comprehensive overview of bicluster on preliminary theoreticalframework, which is started with the brief interpretative introduction ofcluster. As a result, some common bicluster based algorithms and modelsare introduced. The thesis then gave an elaborate introduction ofhomologous genes in genomic sequences, especially the conception, thereason of formation, character and the common predicting method oforthologs which will be used.The thesis does not only concern the technical realization issues ofbiclustering, but also its specific application to the character of orthologssin genomic sequences. Based on the analysis of the character ofbiclustering algorithm and orthologss, some practices are discussed,through which gain deep insight into the research of this thesis.Clustering is a popular data mining technique for extractinginformation from gene expression profiles. Each row of a gene expressionprofile corresponds to a gene. Each column corresponds to a sample, andeach projected value is called an expression value. Biclustering is the verycombo of these two. Biclustering is simply to cluster both rows and columnssimultaneously, so that each resulting cluster consists of a subset of rowsand a subset of columns. The concept of biclustering was first used in geneexpression profile analyzing by Cheng and Church in 2000.The first approach assumes that each projected value in a cluster is theaddition of three components: the background level, the row effect and thecolumn effect, row effects. The algorithms try to identify biclusters thathave small deviations from the above perfect cluster model. The deviation ismeasured by the mean squared residue score. For a perfect cluster that haszero deviation from the model, the H score is zero. In terms of geneexpression profiles, this occurs when all the genes of a cluster have exactlythe same rise and fall pattern of expression across the relevant samples. Ingeneral, the smaller is the H score, the more similar are the expressionpatternsOrthologs are genes in different species that originate from a singlegene in the last common ancestor of these species. Orthologous genes aresuggested to share similar functions, be regulated by similar biochemicalpathways and play similar roles in different species. Thus, it is the bestchoice to use orthologous genes when annotating newly discovered genes.There are mainly two categories of algorithms for predictions of orthologs:phylogenetic algorithms and sequence comparison algorithms. Both of themare based on sequence similarities, whereas they have their owncharacteristics. Phylogenetic ways predict orthologs by reconstructingphylogenetic trees. As a result, they are conceptually accurate, but hard toautomate, and demanding huge amount of computational resources. Incontrast, the later methods are conceptually less accurate but not as complexand require less computational resources, therefore, widely used.The relevant information of orthologs belongs to different COG isdifferent. The distance of Species can be judged by the information of COG.If two Species have too much orthologs, they could be considered to be veryclose. As a result, the topology of the phylogenetic tree of species evolutioncan be judged. We use Cheng & Church algorithm to bicluster, analyze theCOG information of species and adjust different threshold to get differentresult, through which we can judge the relation of species evolution,construct the phylogenetic tree of species evolution.The main work of the thesis can be divided into two parts: Firstly,based on the analysis and application of Cheng & Church algorithm,implement this algorithm. Secondly, apply biclustering algorithm toorthologs in genomic sequences to design a new method to constructphylogenetic trees. Then, put up a test with some data set of commondata-base, and compare with existing phylogenetic trees. The result showsthat, the interference. of the new method is smaller than the whole genomicsequences or other method of construct phylogenetic trees. The new methodachieves the requirement of theory design primarily.
Keywords/Search Tags:Biclustering
PDF Full Text Request
Related items