Font Size: a A A

Analysis Of Biological Sequences Similarity And Research On κ-Word Model

Posted on:2016-01-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:W DengFull Text:PDF
GTID:1220330461985478Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
With the rapid development of science and technology, Human Genome Project has been in full operation, big data of biological sequences grow with each passing day, and the focus of biology shifts from accumulation of biological data to the analysis and interpretation of them. These large number of biological molecular data contain abundant information, how to manage these data and extract as much as possible in-formation is a meaningful work, and therefore many mathematicians, biologists and computer scientists have been attracted to this new cross subject-Bioinformatics. Bio-logical sequence comparison is one of the most important and basic contents, because it plays a critical role in many other researches such as molecular evolution problem, protein structure prediction problem, gene identification problem and so on.Sequence alignment is the traditional method for biological analysis, confined to some its own disadvantages, the free-alignment method emerged as the supplement and development of sequence alignment, which has become a hot issue of computational molecular biology rapidly. In this dissertation, we focus on DNA and protein sequences as research objects. Based on the graphical representation and k word model, we propose some new free-alignment models, study the similarity analysis of biological sequences, and construct some evolutionary trees. The main content includes several aspects as follows:First, considering the nucleotide chemical structure classification, we improve the existing CGR model. Here we build three kinds of CGR-space for the first time, obtain the corresponding CGR-walk digital sequences and extract some feature invariants of DNA sequences. As its application, we make the examination of similarity/dissimilarity among the exon of β-globin gene of different species and obtain the better results. On the one hand, our method facilitates the diversity of graphical representation of DNA sequences. On the other hand, our work can be treated as improvement of CGR results. In this model, the biology chemical properties of base is first taken into account to the best of our knowledge, and our graphical representation is intuitive, invariant is more easy to calculate. Compared with other models, we find our result is also more close to the known biological facts. So our model contain richer biological information.Next, based on the detailed hydrophobic-hydrophilic(HP) model of amino acids, we propose dual-vector curve (DV-curve) representation of protein sequences, which uses two vectors to represent one alphabet of protein sequences. By establishing math-ematical model, we can get the corresponding relation between DV-curve and protein sequences. This graphical representation not only avoids degeneracy problem, but also has good visualization no matter how long these sequences are, and can reflect the length of protein sequence. The utility of the proposed curve is illustrated by two aspects:on the one hand, we make similarity analysis of different species ND6 pro-tein sequences using their intuitive graphical DV curve; on the other hand, we build 24-d characteristic vector in order to carry out quantitative comparison of protein se-quences. The Euclidean distance metric is used to construct the similarity matrix, and reconstruct the phylogenetic tree of 35 coronaviruses based on their spike proteins.In chapter 5, we propose a new κ-word model to analyze biological sequences. Considering the effect of base mutation, we subtract the background probability when defining a new κ-word probability distribution. Finally we obtain 4κ-d feature vector of DNA sequence, and apply it in two examples:48 HEV virus gene sequences and 26 kinds of placental mammals mitochondrial genome sequences, and achieve satisfactory results. At last, we discuss the problem of the optimal value of k.
Keywords/Search Tags:DNA sequences, Chaos game representation, Similarity analy- sis, Sequence alignment, Free alignment, Protein sequences, DV-Curve representation, Graphical representation model, κ-word, Probabilistic model, Phylogenetic tree
PDF Full Text Request
Related items