Font Size: a A A

Genetic Algorithm Search For Transcript Factor Binding Site And Its Improved Algorithms

Posted on:2009-01-29Degree:MasterType:Thesis
Country:ChinaCandidate:M NiuFull Text:PDF
GTID:2178360242981561Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Bioinformatics is an interdisciplinary area which is form during the 1980s, along with the increasing data produced by the"human genome project". With the development of the biology and medical science, especially the promotion of the"human genome project", massive data had been produced and the accumulation is accelerating day by day. All these data contains abundance information which is biologically related. It is a great challenge to the math and biology scientists to analyze, process, and reveal the truth of these data, and fully exploit them.Bioinformatics research using statistics, pattern reorganization, dynamic programming, ANN, and genetic algorithm and etc. to analyze the alignment, structure data, to acquire knowledge about gene coding, gene regulation, to explain the connection of cells, organs, and the regular pattern of individual's generation, growth, sickness and withering, to exploit the origination of life, evolution, and etc, formed a biology periodic chart eventually.All the cells in one kind of creature contain the same genetic information, but a gene expresses itself differently in diverse organs and cells. It is all because of the gene regulation mechanism. The key regulation of the gene expression is in the transcription level. Transcription factor is a specific protein which attaches itself to the promoter, and inhibits or promotes RNA polymerase to initialize the transcription process. And the alignment in up-stream which the transcript factor attached to is called Transcription Factor Binding Sites (TBFS). Normally in the procaryon cell, TFBS is a conserved sequence in the up-stream, about 9-20 base. These are 2 way to locate the TFBS now: one is by the biological method; the other is by the computing method. Since it costs much more while choosing the biological method, people are more likely focus on the computing way.Genetic algorithm is a random optimization algorithm based on the Darwin's evolution theory and Mendelian genetics. It is used in many areas because of its simplicity and high proficiency. GA provided an random optimization frame in which the implementation including coding solution space, formation of the fitness function, selection of operator and choosing parameters. Coding method selection, formation of the fitness function and cross-over, mutation operator are directly related to the specific problem. On the other hand, the performance of the GA is influenced by the parameter such as scale of the population, mutation rate and etc. So how to choose them properly is concerned by many researchers.In the 3rd chapter of this paper, a novel method is proposed which combines Matrix coding and GA together. In the paper, we use frequency matrix to form the model, and avoid the problem of losing too much information in the points which has weak conservation. Meanwhile we used GA to predict the TFBS. We used the transcription factor information from Escherichia coli and yeast as test data. The results showed that the method has a strong ability to recognize the conserved sequence, and is better than the software Gibbs Sampler which is used for comparison.In the 4th chapter of the paper, we used the method of self-adapting mutation probability and self-adapting population scale to improve the original GA. Against to the problem of local optimization, we used fitness refreshing rate as parameter, adjust the mutation probability dynamically. In this experiment, I used two functions to improve the algorithm, exponential function and sigmoid function, and adjusted the parameter to make it more suitable to the experiments. Through the method of using fitness refreshing rate to judge whether change the scale of population is enable us to test the genetic drifting (lead to malfunction of convergence) because of the small scale. The experiment results showed that the GA with self-adapting population scale operator is faster to converge at the early time of the process.At the end of the paper, we discuss the work of the paper and give some hope to the future work. The prediction of the TFBS are comparatively mature with traditional methods and it is a focal point to use a novel and advanced method to surpass the old ones in bioinformatics. How to adjust the parameters to make the improved algorithm more suitable to the reality, how to make the algorithm be able to function with large population scale, and how to make it more accurate to identify multiple binding sites in one sequence, still remain to be explore.
Keywords/Search Tags:Transcript
PDF Full Text Request
Related items