Font Size: a A A

Research On Algorithms Of Sorting Unsigned Genomes By Cut-And-Paste Operations

Posted on:2011-02-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:X W LouFull Text:PDF
GTID:1118360305450550Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The genetic information is recorded by chromosomes of the species. A chromosome consists of a serial of genes with different genes determining different characteristics of the species, and a genome is a set of chromosomes. With an increasing number of sequenced genomes, it is now possible to compare whole genomes. Different species often share similar genes that were inherited from common ancestors, and large scale mutations are observed in more distantly related genomes. It is widely accepted that the number of genome rearrangement needed to transform one genome into another is a measure of evolutionary distance between two species. Rearrangement of gene sequences is an important mode of biological evolution as well as the main reason for the diversity of plants, mammals and bacteria.As an important tool to study gene sequence and gene family evolution, many mathematical models based on genome rearrangement are proposed. And the most common approach is to infer a sequence of rearrangements under the assumption of parsimony, motivating the following combinatorial problem:Given two genomes that have the same gene content, compute a shortest sequence of rearrangements that are needed to transform one genome into the other. The length of such a sequence is called the genomic distance between these two genomes. Common rearrangement operations include reversals, transpositions and transreversals etc. These operations frequently happen on a single chromosome during evolution. They have a common feature of first cutting a segment off the chromosome, possibly reversing it and then pasting it back to the remaining chromosome. If the positions of cut and paste are the same and the segment cut is reversed, call this operation a reversal. If the positions of cut and paste are different, call it a transreversal when the segment cut is reversed or a transposition when the segment cut is not reversed. Therefore, a general term called cut-and-paste is proposed to represent all these operations, i.e., a cut-and-paste operation can be a reversal, a transposition, or a transreversal. Using a permutation to express a chromosome or a genome, the problem of genome rearrangement can be viewed as the problem of sorting permutations by cut-and-paste operations. Specifically speaking, giving two permutations A and B, find a shortest sequence of cut-and-paste operations that transforms A to B. And the number of operations used during this process is called the distance between A and B.Biologists derive gene orders either by sequencing entire genomes or by comparative physical mapping. Sequencing provides information about directions of genes and allows one to represent a genome by a signed permutation. However, sequencing of entire genomes is somewhat expensive and, at the moment, complete genomic sequence are known only for relatively short viral and organelle DNA. As a result, many experimental data on gene orders are based on comparative physical maps. Physical maps usually do not provide information about directions of genes and therefore lead to representation of a genome as an unsigned permutation. Thus two categories of combinatorial problem are derived:sorting signed or unsigned permutations by shortest rearrangement operations.Recently, the problem of genome rearrangement has gain much attention from both computer scientists and biologists, and many research results emerged. For the problem of sorting by reversals, Hannenhalli and Pevzner solved the signed version using breakpoint graph. They designed an exact polynomial time algorithm with time complexity O(n4). And the breakpoint graph has become a useful tool for solving this kind of problems. Currently the reversal distance can be computed in linear time. However, the problem of sorting unsigned permutations by reversals is hard to solve. It is proved to be NP-hard, and the best performance ratio known is 1.375.For the problem of sorting by transpositions, only unsigned permutation is discussed since this operation never changes gene's directions. The complexity of this problem is still open, several 1.5-approximation algorithms are proposed. And currently the best performance ratio is 1.375.All the operations mentioned above are single operations. In fact, several kinds of rearrangement operations may be involved during evolution. And most research results on this problem are for signed genome. Gu et al. gave a 2-approximation algorithm for sorting by transpositions and transreversals. Lin and Xue introduced the operation revrev, and gave a 1.75-approximation algorithm for sorting by reversals, transpositions, transreversals and revrevs. Hartman and Sharan gave a 1.5-approximation algorithm for sorting by transpositions and transreversals. Eriksen assigned weight 1 to reversal and 2 to transposition/transreversal, and designed a (1+ε)-approximation algorithm for sorting by weighted operations. However, for unsigned genome, only a 3-approximation algorithm and an improved version with performance ratio 2.8386+δare available, both allow reversals and transpositions.Cranston et al. first studied the bounds for cut-and-paste sorting of permutations. They showed that every unsigned linear permutation of n elements can be sorted by at most(?)2n/3(?) cut-and-paste operations. This result can be extended to unsigned circular permutations. But their algorithm does not give any approximation guarantee.This thesis mainly focuses on unsigned genome. The major results contained in this thesis are as follows:(1) For the problem of sorting unsigned permutations by cut-and-paste operations, we design a 2.25-approximation algorithm for circular permutations. This is the best known performance ratio for this problem. We identify in the breakpoint graph a structure called tie to denote the alternating path of length 5. Ties can be classified into 6 types and each type has its way of removing breakpoints so that every cut-and-paste operation performed can remove at least 4/3 breakpoints averagely. Since the upper and lower bound we used only involve one variable b(π), traditional discussions on the number of cycle decomposition is avoided, and a better ratio is available. If the operation revrev is allowed, the algorithm can also be used to sort linear permutations with the same performance ratio.(2) We rectify the polynomial time algorithm for the reversal sorting problem on unsigned permutations with limited singletons. After carefully analysis on the effect of flipping a canonical 2-strip, we restrict the 2-strips that are flipped in the original algorithm to a subset and accordingly update the definition of optimal spin. Thus the failed case in the original algorithm is solved. (3) We study the problem of sorting unsigned permutations by weighted cut-and-paste operations, where the weight proportion between reversals and transpositions/transreversals is 1:2. Using the weighted distance formula for signed permutations, we discover that for those singleton-free permutations, the optimal spin for reversal sorting is also the optimal spin for weighted sorting. Therefore, for the weighted problem of sorting unsigned permutations with limited singletons, we get the same performance ratio with signed permutations.The thesis has the following structure:Chapter 1 introduces several genome rearrangement problems, including the basic definitions, notations, fundamental results and research situation for each problem and its sub-problems.Chapter 2 studies the problem of sorting unsigned circular permutations by cut-and-paste operations. We first identify in the breakpoint graph two important structures called ties and folds. Based on detailed classification and careful study of the crossing relationship between ties and folds, we design polynomial time approximation algorithm with performance ratio 2.25. At the end of the chapter, we show how to use our algorithm to sort linear permutations with the same performance ratio if another operation revrev is allowed.Chapter 3 rectifies the polynomial time algorithm for the reversal sorting problem on unsigned permutations with limited singletons, which was designed by Hannenhalli and Pevzner. We point out a failed case in their algorithm and then give the new algorithm and proofs.Based on Chapter 3, Chapter 4 studies the weighted sorting problem on unsigned permutations, where the weight proportion between reversals and transpositions/ transreversals is 1:2. Using Eriksen's (1+ε)-approximation algorithm on signed permutations and the weighted distance formula, we design a (1+ε)-approximation algorithm for the weighted sorting problem on unsigned permutations with limited singletons.
Keywords/Search Tags:Genome Rearrangement, Reversal, Transposition, Transreversal, Cut-and-Paste, Permutation, Approximation Algorithm
PDF Full Text Request
Related items