Intelligent Algorithms For Gene Regulator Binding Sites Identification

Posted on:2008-12-13

Degree:Master

Type:Thesis

Country:China

Candidate:X Y Chang

Full Text:PDF

GTID:2178360212996793

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

The 21th century is the era of biology science rapidly developing. With the implementation of the"human genome project", the data of the nucleic acid sequence and the data of protein's sequence and structure are accumulate at the speed of exponential level. These data include plenty of information about the essence of life. These data include the information stored in the genetic and protein code as well as the experimental results from a variety of sources such as the gene expression and patient statistics. With these so many and so complex data, bioinformatics is set up and become more and more prosperous. Bioinformatics derives from using computer to analyze and store biological data. It is a rapidly developing subject and is highly interdisciplinary. Bioinformatics uses the techniques and concepts of informatics, statistics, computer science, mathematics and linguistics. It has many applications in the areas of biology and medicine. Research in bioinformatics includes developing methods for storage, retrieval, and analysis of the data. There are three very important sub disciplines in bioinformatics: 1 the development of new algorithms and statistics methods with which to assess relationships among members of large data sets 2 the analysis and interpretation of various types of data including nucleotide and amino acid sequences, protein domains, and protein structures 3 the development implementation of tools that enable efficient access and management of different types of information.One of key issues in modern molecular biology is to research and explore the principle and the process of the gene expression. The regulator and its binding sites mainly lead this mechanism. The regulator is a kind of protein which can regulate the expression of genes. It is also called transcription factor. During the process of gene expression, transcription factor interact with a DNA segment which locate in the promoter in the upstream of the gene. It can enhances or decrease the level of gene express. The outside circumstance dominates whether the regulator can interact with DNA. The DNA segment which can interact with regulator and so regulates the expression of gene is called transcription factor binding sites. Generally speaking, binding sites is a conserved DNA segment in the upstream of genes. Their length is about 8-20 bps.The existent methods which are used to identifying Regulator binding sites include biochemistry methods and some computational methods. The disadvantages of the biochemistry methods include long timespending and high cost. The existent computational methods include greedy algorithm, Expectation & Maximization (EM) algorithm and Gibbs Sampling methods. This paper introduces the most popular methods: EM algorithm and Gibbs Sampling method during the second chapter. All these algorithms are local searching algorithm. These algorithms often converge to a local optimal result and get a motif without biology sense.Evolutionary computing is a new computational thinking. It imitates the evolutionary method of biology, and continually optimizes the current result (local optimal), which can get the more optimal result than the current one. Evolutionary algorithm includes methods of Genetic Algorithm (GA), evolutionary programming (EP), Particle Swarm Optimization (PSO) etc. The main work of this paper is to research how to use computational method to identify regulator binding sites from a set of coexpression gene sequences.In the third chapter, a new framework of using evolutionary computing to find binding sites is proposed. Different from the existent method, under this framework, the first step is to identify which model is used to represent the binding sites. For example, consensus model and PWM mode can be used. During the second step, parameters of the model are coded as the individuals for evolutionary computing. Then, result is got according to the evolutionary computing rules. Under this framework, genetic algorithm and PSO algorithm– two most popular algorithms are implemented. The model used to represent binding sites motif in this paper is the position weight matrix (PWM)。For the different characteristic of genetic algorithm and PSO algorithm, this paper proposed two effective coding methods for these two different algorithms. To use these methods, any individual that is not adapt to the restriction will not be generated, which make these algorithms more efficient. The data used to do experiment are downloaded from network. These are the most useful data. One is the 5 different yeast Saccharomyces Cerevisiae transcription factor binding sites. The two algorithms can get the results which are the same to the existent consensus sequences. This can prove that this algorithm is effective. Another data is the E.coli CRP 18 genes. The results show that these two algorithms can get the results more accurate than EM algorithm and Gibbs Sampler.The fourth chapter of this paper proposed the second method: using greedy methods to search conserved segments in the sequence set. The proposed greedy algorithm in this paper is an iterative progress. During each step, only one segment of the current optimal motif is renewed and abetter motif is got after this step. In order to solve the problem of high computational complexity, this paper analysis the information content function. Some unchanged data for the iterative progress is calculated firstly and preserved in memory. During the process of calculation, it only need do sum. The result is that reduce the calculation time of multiplication, division and logarithm very much. So, only very small number of memory cost is spent, the time complexity is reduced very much. Results of experiment show that: the advanced algorithm is swifter 5-8 times than the orient original algorithm.The advanced greedy algorithm can get a result which is much better than the initial one. In chapter 4, this greedy algorithm is combined in the classical genetic algorithm to solve this problem. After decoding an individual, greedy algorithm is used to find a better result than the current one. The better result's information content is the fitness of this individual. Experimental result shows that this method is effective. It can get better result than MEME (using EM algorithm), AlignACE, and Gibbs Sampler (using Gibbs Sampling method).The last part of this paper gives some discussions. It also gives some hope for the future work. The problem of a sequence contains zero or more than one binding sites can be solved in future. Algorithms proposed by this paper are based on an assumption: each gene sequence contain only one Motif segment. Another suggestion is that sequential pattern mining can be applied in this field. In the future work, we can focus on how to apply sequential pattern mining to identify binding sites.

Keywords/Search Tags:

Identification

PDF Full Text Request

Related items

1	Research On P2P Traffic Identification In Asymmetric Routing
2	Research On Finger Multimode Identification Technology
3	Subspace Methods For System Identification And Predictive Control Design
4	The Research Of Tag Identification Performance In Passive UHF RFID System
5	Research And Reality Of Process Identification
6	Research On RFID Tag Identification Protocols For Internet Of Things
7	Research On Tag Identification Technology In Mobile RFID Systems
8	Research On The Key Technologies Of Service Identification In IP Networks
9	Dynamic Parameter Identification Of Industrial Robots
10	Iterative Identification For Multivariable Systems Based On Equation-Error Models