Font Size: a A A

Distance Measures Of Biological Molecular Data And Their Applications

Posted on:2010-12-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:X Q ZhengFull Text:PDF
GTID:1100360275458046Subject:Basic mathematics
Abstract/Summary:PDF Full Text Request
Comparison of biological molecular data is one of the most fundamental and important tasks in Bioinformatics.Through the comparison of molecular sequences,one can obtain some functional,structural and evolutionary information about the corresponding sequences.Many other research areas in bioinformatics,such as database search,phylogenetic tree construction,prediction of structure and function of proteins,DNA sequence assembly,all need first to estimate the similarity between sequences.Traditional approach to achieve this aim is sequence alignment.But this method suffers from the drawback of high computational load and inherent ambiguity of the alignment cost criteria.So there is a great need to develop new sequence comparisons free of alignment,and investigate their use in other bioinformatics areas,especially whole genome phylogenetic analysis.In this dissertation,we focus on two main categories of alignment-free sequence comparisons. The main contents are arranged as follows.In Chapter 2,we study two distance measures based on the frequencies statistics of short strings in biological sequences.The first can be considered as a revision of the classical Relative Entropy(RE).This method avoids the degeneracy accompanied by the absent of some words when using RE to measure distance.In the second approach,under the Poisson model of the word occurrences,we defined the "expression level" of an individual word.Then the distance between two sequences is evaluated by the discrepancy of each word in these two sequences.The validity of our approaches is shown by constructing the phylogenetic tree of 25 viruses including SARS-CoVs.In Chapter 3,we investigate a distance metric based on the complexity of symbol sequences.This metric uses the saving in joint compression as a measure of distance between two sequences,and makes few assumption on the evolutionary model.Therefore, it does not suffer greatly from some evolutionary events,e.g.,large rearrangements and transposon activity.As its application,we construct the evolutionary tree of 24 protein structures and predict the protein subcellular location of 3 widely used data sets.Additionally, as to the comparison of protein structures,we propose a "symbol assignment" approach,which can translate protein structures into symbol sequences.Characteristic sequence is a coarse-grain description of the primary DNA sequence. Obviously,an individual characteristic sequence will keep some biological information and lose others.But what kind and how much an individual binary sequence carries? In the last chapter,we give an answer from the evolutionary perspective through constructing the phylogenetic trees of three data sets.
Keywords/Search Tags:Bioinformatics, Distance measure, Phylogenetic analysis, Protein subcellular location prediction, Sequence complexity
PDF Full Text Request
Related items