Font Size: a A A

Based On Latent Semantic Analysis Of Eukaryotic Promoter Recognition

Posted on:2010-12-21Degree:MasterType:Thesis
Country:ChinaCandidate:Y QinFull Text:PDF
GTID:2208360302958732Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Gene identification is an important branch of bioinformatics, it employes biological experiments or computer methods to identificate a DNA fragment with certain biological feature in the DNA sequence. Promoter is a gene regulatory DNA sequence, which indicates the location of Transport Start Site and can locate gene. The promoter prediction algorithm proposed by this paper is a kind of gene identification tool, which can find general location of gene and provide reference for biological experiments.Latent Semantic Index (LSI) is widely applied in the text mining. The author tries to apply LSI in the prediction of eukaryotic promoters and proposes promoter prediction algorithms base on Global LSI and Difference LSI. The experiment results are compared with the document results. In addition, the eigenvalues of Laplace matrixes are used to analyze similarity between DNA sequence sets, and applied to choose and evaluate samples for promoter prediction algorithm.The dissertation consists of three parts:Firstly, a promoter prediction algorithm base on Global LSI is proposed. The experiment proves that LSI has effective function in dimensionality reduction and classification improvement. We also analyze influences of multiple factors, such as the expression of DNA sequences, different sample and filtering threshold etc, and conclude the disadvantage of GLSI model . Then, a new DLSI promoter prediction algorithm is proposed on the base of GLSI promoter prediction algorithm.The experiment results show effectiveness of the DLSI model and conclude the DLSI model's advantage over GLSI model by similar analysis of various factors.Finally, a new divergence based on Laplace matrix eigenvalues is defined in order to measure similarity of different kinds of DNA sequence sets. Experiments on synthetic data and real DNA sequence sets show that this divergence can measure the similarity between sequence sets. This measurement is used to choose and evalue the sample for promoter identification, and finds that the lower similarity the better identification result.The innovations of the paper are as follows:1 proposing a promoter prediction algorithm base on GLSI model, and its results is better than the document results. 2 proposing another promoter prediction algorithm base on DLSI model. It avoids choosing sample and setting threshold in GLSI model, and further improves the result of promoter indentifying.3 proposing the concept of a new divergence, which effectively measures the similarity degree between sequence sets.
Keywords/Search Tags:LSI, promoter identifying, global model, difference model, Laplace matrix
PDF Full Text Request
Related items