Font Size: a A A

Study On Computational Optimizational Problems In Computational Biology

Posted on:2006-09-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:X L WangFull Text:PDF
GTID:1100360155467063Subject:Operational Research and Cybernetics
Abstract/Summary:PDF Full Text Request
Computer scientists and mathematicians working in computational molecular biology have focused largely on the analysis of DNA and amino acid sequences since its inception in the early 1970s. Problems that have received the most attention include similarity searching, genome rearrangement and phylogeny reconstruction. The researchers analyze data coming from distinct species and learn from the similarities and differences in related genomes. Many studies convincingly proved that genome rearrangements represent a common mode of molecular evolution in comparative ge-nomics, it can provide meaningful insights into biology. This important realization is now apparent to researchers in diverse fields.The genome rearrangement problem is to find an optimal scenario transforming one genome to another via minimum number of rearrangement events. This has been a fundamental problem for mathematicians and computer scientists. We call the minimum number of rearrangements transforming one genome to another as the evolution distance. And we call the problem to find an optimal rearrangement sequence of transforming one genome to another as the problem of genomic sorting.Genome rearrangement and protein similarities search are hotspots and difficulty of research in comparative genomics recently. In this dissertation the problems associated with multi-chromosomal genomes rearrangement are considered. The complexities of some algorithms are reduced and approximate algorithms are provided. And a problem associated with protein similarities search is discussed in this dissertation, the complexity and approximate algorithms are discussed. The dissertation is split into seven chapters.In Chapter 1, we introduce the related combinatorial optimization problem considered in this dissertation, and review the existing literature on the algorithmic aspectsof these problems, and summarise the fundamental results and notations.In Chapter 2, we study the problem of multi-chromosomal genome rearrangement by (reciprocal) translocation. In 1996, S. Hannenanlli showed that the problem of genome rearrangement evolving by reciprocal translocations is in P, and gave aO(n3) algorithm for the problem.At the start of the chapter, we analyze the structure of the so-called cycle graph of a signed permutation in the context of genome rearrangement problem. We give the linear algorithm, Algorithm 2-1, for computing the connected component in the overlap graph of the signed permutation, which is the base in Chapter 2, Chapter 3 and Chapter 4.Then we introduce the concept of minimum companion subpermutation(co-MSP), and analyze the properties of minimum subpermutations (MSP) by analyzing those of co-MSP. We define a separating number(a) on chromosomes and give the relation of a and the structure of the cycle graph. A linear algorithm, Algorithm 2-2, is devised, to compute some parameters needed for computing the translocation distance. The main conclusion goes as follows.Theorem 6 The translocation distance between two signed genomes can be computed in O(n), where n is the number of genes in one genome.Chapter 3 contains a number of contributions that help to improve our understanding of the algorithmic issues involved in the problem of genome rearrangement evolving by reversals, translocations, fissions and fusions. We begin the discussion from special connected components in the interleaving graph which leads to a better understanding of how these components affect the general problem. We define some functions, reflecting the properties of the special connected components.In Section 3.3, we first design a linear algorithm, Algorithm 3-1, that will be used to decide the existence of a special property for an connected component, which is important in the latter.Secondly we analyze the labels on a special concatenation of the given genomes. We execute Algorithm 2-1 to get the positions' labels by components. This leads to linear time algorithms, Algorithm 3-2 and Algorithm 3-3 for computing the genomic distance between signed multi-chromosomal genomes. Hence we have the following conclusion.Theorem 12 The genomic distance can be computed in the general case in O(n) time, i.e., we can compute the length of a shortest sequence of reversals, translocations, fissions and fusions required to transform one genome into another in linear time.In Chapter 4, we go on to study the problem of multi-chromosomal genomes rearrangement. We study the algorithm of computing the optimal rearrangement sequence of multi-chromosomal genomes, which evolving by reversals, translocations, fissions and fusions, where each gene occurs exactly once in a genome. We manage to transform it to the problem of sorting by reversals. In order to get an optimal evolution sequence, we manage to mimick multi-chromosomal rearrangements by reversals. How to reduce or avoid the number of nonsense reversals which appeared in the transformation is a key problem. We convert it into the problem of finding an optimal capped concatenation of the two genomes with time less than O(ny/nlogn) to meet our purpose. The optimal capped concatenation depends on an optimal orientation (flipping) of the cycle graph and an optimal cappings of the two genomes.After the implement of Algorithm 3-2 on the cycle graph, we can have a linear algorithm, Algorithm 4-1, which could find an optimal flipping of the cycle graph. We will find an optimal capping by Algorithm 4-2 in linear time. Then we devise an linear algorithm, Algorithm 4-3, to find an optimal orientation of the cycle graph after flipping a segment of some chromosomes.Finally, we devise a linear algorithm, Algorithm 4-4, to find the optimal concatenation. Therefore, we can get the final algorithm, Algorithm 4-5, for the optimal rearrangement sequence of multi-chromosomal genomes, and get the following conclusion.Theorem 18 The optimal rearrangement sequence of multi-chromosomal genomes can be obtained by Algorithm 4-5 in O(ny/nlogn).In Chapter 5, we consider the translocation media problem. This is a simple version of the problem phylogeny reconstruction. We focus on the Translocation Median Problem(TMP), which is a problem of finding a genome which is closest to a given set of genomes with respect to the translocation distance.We study the TMP in signed and unsigned models respectively. In Section 5.2, we study the unsigned model. We study a median problem for breakpoints, which can be transformed to the Travelling Salesman Problem (TSP), to approximate TMP. We giveAlgorithm 5-1 to approximate TMP. For the signed model, we devise Algorithm 5-3 to approximate TMP by computing a maximum weighted matching in a constructed graph. We also consider to approximate TMP by transforming the signed model to a unsigned model, and devise Algorithm 5-4 for this purpose. In particular, we show that there exist optimal solutions in some special cases in signed model, and give the following conclusions.Theorem 21 For m = 2, any genome in an optimal evolution genomes sequence from III to IT2 is an optimal solution for the TMP. And the optimal value is d(IIi, n2).Theorem 23 For m = 3, if there exist two genomes UiJlj, such that the translo-cation distance between them is no more than k, where k is a, constant integer. Then we can get an optimal solution for the TMP in O(2hn2+2k).In Chapter 6, we consider the problems of sorting by reversals and sorting by transpositions in a simple case that the strings to be compared are drawn from a binary alphabet . Christie and Irving proved that the problem of finding reversal distance between binary strings, and therefore between strings over an arbitrary fixed alphabet, is NP-hard. But the problem of finding transposition distance between binary strings is still open. We focus on binary strings and design two approximation algorithms, Algorithm 6-1 and Algorithm 6-2, for sorting binary strings by reversals and for sorting binary strings by transpositions, respectively. Each algorithm produces a solution which is within an expected factor 2 of the optimal solution.In Chapter 7, we consider a special combinational problem on protein similarities, i.e., for a given mRNA and a protein, we try to determine a sequence of nucleotides which has maximal similarity to the given mRNA as well as its induced protein sequence is maximally similar to the given protein, and additionally obeys some specified secondary structure constrains. Backofen, Narayanaswamy, and Swidan modelled this problem in terms of an optimization problem called MRSO (MRna Structure Optimization). They proved the decision version of MRSO is NP-complete and a 2-approximation algorithm has been proposed in the general case, and they proved that there exists a linear time algorithm if the considered secondary structure is an outer-planar graph. B. Dirk investigated its complexity further, and gave a 4-approximation for a restricted version of the problem of MRSO.We consider a restricted problem of MRSO, which satisfies S-conditions (not in-eluding stop-codon). We and call it as the MRSOS problem. The complexity will be discussed in Section 7.3. In Section 7.4, we present the approximation algorithms, Algorithm 7-1 and Algorithm 7-2, for a restricted version of the problem of MRSOS. We consider a special version of MPSOS with property A.Theorem 31 The problem MRSOS is APX-hard and not approximable within | — e, for an arbitrary e > 0.Theorem 32 MRSDS-dl is APX-hardness, and not approximable within j||| — e.Theorem 34 Algorithm 7-2 is a polynomial 7-approximation for MRSOS-dl.Theorem 35 For each graph G with property A, Algorithm 7-1 guarantees a 7-approximation for MRSOS-dl, and Algorithm 7-2 guarantees a 4-approximation for MRSOS-dl.The major new results contained in this dissertation are as follows:1. A linear-time algorithm to compute the translocation distance between two multi-chromosomal signed genomes is presented. This is the fastest known algorithm for the problem, and improves upon the O(n2logn) algorithm given by D. Zhu and S. Ma in 2002, and the O(n2) algorithm given by L. Wang et al in 2004.2. A linear-time algorithm that computes the length of a shortest sequence of reversals, translocations, fissions and fusions required to transform one genome into another is given. And a O(ny/nlogn) algorithm to compute the rearrangement sequence between two multi-chromosomal signed genomes is obtained. Both problems' previous bound were O{n2). Hence our algorithms are the fastest algorithms for the problems so far.3. The translocation median problem in the unsigned and signed models are explored respectively, approximation algorithms for the two situation are discussed.4. Sequence comparison on the binary strings is studied, approximation algorithms for sorting binary strings by reversals and sorting binary strings by transpositions are designed.5. A problem of protein similarity search under mRNA structural constraints is studied, the complexity and approximation are acquired.
Keywords/Search Tags:genome rearrangement, signed genome, translocation distance, ge-nomic distance, median problem, sorting, protein similarity, approximation algorithm
PDF Full Text Request
Related items