Font Size: a A A

Analysis On The Characteristics Of Biological Sequences Based On Time Series Theory Methods

Posted on:2010-11-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:J GaoFull Text:PDF
GTID:1100360278474883Subject:Light Industry Information Technology and Engineering
Abstract/Summary:PDF Full Text Request
DNA, RNA and protein sequences are of fundamental importance in understanding living organisms, since all information of the hereditary and species evolution is contained in these macromolecules. After DNA and protein are sequenced, how to gain more bioinformation from these DNA and protein sequences is a challenging problem. The nucleotides and amino acids stored in GenBank have been growing exponentially. It has become important to improve on new theoretical methods to conduct DNA and protein sequences analysis. Many biologists, physicists, mathematicians and computer specialists are attracted to this interesting research field.After introducing the background of Bioinformatics, this paper first introduces the time series theory methods applied to characteristics researches of biological sequences. We introduce the short-memory ARMA model and the long-memory ARFIMA model which will be applied to biological sequences analysis in the paper.Chaos Game Representation (CGR) is an iterative mapping technique that processes sequences of units, such as nucleotides in a DNA sequence or amino acids in a protein, in order to find the coordinates for their positions in a continuous space. A CGR-walk model is proposed based on CGR coordinates for the DNA sequences. The CGR coordinates are converted into a time series model, and a long-memory ARFIMA (p, d, q) model is introduced to DNA sequence analysis. This model is applied to simulate real CGR-walk sequence data of ten genomic sequences. Remarkably long-range correlations are uncovered in the data and these models are fitted highly reasonably by ARFIMA (p, d, q) models. As a classical time series model with perfect algorithm, ARFIMA model can help us find out the unknown characteristics of DNA sequences.Since there is low success rate in the selection of the right ARFIMA model, along with the complicated maximum likelihood calculations in the parameters estimation, the approximation by a short-memory process in the prediction of ARFIMA model is a topic of interest in the literature. We analyze the approximation of a general long-memory ARFIMA(p, d, q) process by a short-memory ARMA(1, 1) process. To validate this approximation, a mean square error forecast criterion is proved. The performance of the ARMA(1, 1) approximation to an ARFIMA model is illustrated by using an application to ten DNA sequences. We find an approximating model with more simple algorithm.We also study the approximation of a long-memory fractionally differenced ARFIMA(0, d, 0) model by a short-memory ARMA(2, 2) process. Based on the difference of the efficiency loss ratio of the ARMA(2, 2) model and the ARMA(1, 1) model, we know that the approximating ARMA(2, 2) model is better than that ARMA(1, 1) model to ARFIMA(0,d,0) model. To validate this conclusion, the two approximating models are applied to simulate CGR-walk sequence obeying ARFIMA(0, d, 0) model .We find the approximating ARMA(2, 2) model is better than that ARMA(1, 1) model to ARFIMA(0,d,0) model according to the prediction error standard deviation.By modifying the Kalman filter recursive equations, the proposed method allows an efficient estimation of a long-memory ARFIMA process with missing values. In order to illustrate the application and effectiveness, we analyzes a CGR-walk sequence of DNA sequence, and draws a conclusion: the proposed approach is really very efficient.Based on the CGR-walk model of DNA sequences, a new CGR-walk model of the linked protein sequences from complete genomes is proposed based on the detailed HP model. A long-memory ARFIMA (p, d, q) model is introduced into the protein sequence analysis. This model is applied to simulating real CGR-walk sequence data of twelve linked protein sequences from twelve complete genomes of bacteria. Remarkably long-range correlations are uncovered in the data and the results from these models are reasonably fitted with those from the ARFIMA (p, d, q) model.
Keywords/Search Tags:Chaos Game Representation (CGR)-walk model, DNA sequence, protein sequence, short-memory ARMA model, long-memory ARFIMA(p, d, q) model, mean square error(MSE) criterion, maximum likelihood estimation (MLE), state-space model
PDF Full Text Request
Related items