Font Size: a A A

Research On Several Models Based On K-word For The Analysis Of DNA Sequences And Applications

Posted on:2013-01-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y J HuangFull Text:PDF
GTID:1110330371996671Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
In the20th century, science and technology developed rapidly, which promoted the development of life science. A huge mass of the data of biological molecules were produced as the Human Genome Project launched in the1990s. In order to manage the data with rich biological information and extract useful information from them, many biologists, mathematicians and computer scientists were attracted to the new area and study actively. Computational molecular biology is a cross subject born in these challenging studies. The core of this research area is to analyze biological sequences. The aim of the analysis of biological sequences is further to obtain the biological information of both the structure and the function that biological sequences expressed. In recent decades, the models and methodologies of the analysis of biological sequences can be sorted into two main types:the alignment and the alignment-free models. It has realized that the alignment models of multiple sequences are not suitable for the huge mass of sequence data because of their computational limitations. Many scholars are paying more attention to the alignment-free methods. The alignment-free models of the analysis of DNA sequences based on k,-words are put forward in this article. The main achievements can be summarized as follows:In chapter2, a new geometrical representation model of DNA sequences is established. With the consideration of the ordered dinucleotide (2-word), a DNA sequence is mapped into a3D curve. The utility of the proposed curve is illustrated by mutation analysis, similarity analysis and evolution analysis. In the similarity analysis and evolution analysis, we propose a new simple and effective numerical descriptor characterizing a DNA sequence. By reconstructing the phylogenetic tree of11species and comparison with other methods we find that the proposed model has richer biological information. The model is an effective supplement of the existing geometrical representation models.In chapter3, by generalizing the idea of pseudo-amino acid composition (PseAA) to the analysis of DNA sequences, we construct a new model. This model takes the dinucleotide as the research object. We revise the part of the occurrence frequencies of20amino acids in the method of the pseudo-amino acid composition by replacing the frequencies of16 dinucleotides. Then eight important dinucleotides are chosen from sixteen dinucleotides. And we select the eight LZ complexity factors of these eight dinucleotides' logical sequences of a DNA primary sequence as PseAA components. Finally, we characterize a DNA sequence with a24-dimensional vector. The Euclidean distance metric is used to construct similarity matrix and PHYLIP software is used to reconstruct the phylogenetic trees of two data, which illustrate the validity of this model.In chapter4, we propose a probabilistic model of a DNA sequence. Firstly we define a new probability distribution of a k-word in a DNA sequence. This probability distribution includes not only the frequency of a k-word but also its positions. Considering the effect of the nucleotide mutation, we subtract the background probability from the new probability of each k-word. Finally, a novel characteristic vector is derived which is composed of their relative differences to characterize a DNA sequence. As for application, we reconstruct the phylogenetic trees of two data and use the INDELible software to illustrate the reliability and robust of the proposed method. Through the comparison with other methods, it shows that this characteristic vector contains abundant biological information which is a convictive tool for the analysis of DNA sequences.
Keywords/Search Tags:DNA sequences, k-word, Geometrical representation model, Similarityanalysis, Evolution analysis, LZ complexity, Probabilistic model
PDF Full Text Request
Related items