Font Size: a A A

Studies On Algorithms For Genome Comparision

Posted on:2012-11-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:H T JiangFull Text:PDF
GTID:1118330335485270Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Computational genomics is an area dealing with the processing of genomic data and inferring useful biological information from it. The typical problems include computing genomic rearrangement distances, reconstructing ancestral genomes and gene orders from homologies between extant species, and handling redundant or missing genetic information. Many of these problems are NP-hard, so naturally people seek for approximation algorithms in the past. However, due to the nature of this kind application, most of the approximation algorithms may result in inaccurate or false biological information which could be disastrous. So applying parameterized (FPT) algorithms to handle problems in computational genomics is more practical.Computing genomic distance on gene order is a fundamental problem in computational biology. Genome rearrangement is a very complicated process, general global rearrangement operations such as reversals, translocations, fusions, fissions, transpositions and block-interchanges have been proposed to handle gene order. As the development on the problems of sorting genomes by single rearrangement operation, plenty of attentions are paid on combinatorial operations since its similarty to the factual genome evolution events. Then sorting genomes by double cut and join becomes more and more popular in computational genomics.Reconstruction of ancestral genomes and gene orders from homologies between extant species is a long-standing problem in comparative genomics. Due to the availability of more and more sequenced genomes, the bioinformatics method is likely to play a more important role in this process. Recently, the most widely used method is to infer ancestral genome segments, called Contiguous Ancestral Regions (CARs), from syntenic information that are conserved in extant species. The possible CARs are eventually stored in the corresponding PQ-trees. So it is critical to comparing two PQ-trees or comparing one PQ-tree and some sequences for inferring ancestral genomes or CARs.Handling noisy and missing genetic information is an important part of research in computational genomics. One easy example is on ancestral genome reconstruction where some genes exist in earlier generations but completely disappear in extant species or vice versa. It is necessary to identify noisy or redundant gene markers in genomic maps. Motivated by the trend of genome sequencing without completing the sequence of the whole genomes, fill an incomplete multichromosomal genome by pieces of genome segments to obtain a similar genome instead of the orginal one, so the expense of whole genome sequencing will be under control.In this paper, we investigate the sorting unsigned linear genomes by DCJ operations problem, the sorting permutations by short block-moves problem, the minimum common string partition problem, the similarity of PQ-trees under breakpoint distance problem, the maximum strip recovery problem and its complement, the scanffold filling problem and the minimum co-path set problem. For these NP-complete problems, we devise approximation algorithms and fixed-parameter tractable (FPT) algorithms; while for the others, we devise polynomial time exact algorithms.The main contributions of this thesis are as follows:I. On theory of algorithm design.1. Formal definition of weak kernel and its application to FPT algorithm designon computational biology problems.Most of the problems on computational biology are NP-complete, which means that it is unlikely to have polynomial time exact algorithm unless P=NP. Meanwhile, approximation algorithms even with constant approximation factor are not so useful for biologist. On the contrary, FPT algorithm can obtain the optimal solution in acceptable time. Kernalization is a tranditional technique in parameterized complexity that compresses the size of the instances of the problems. Unfortunately, a large quantities of computational biology problems seem unlikely or not obviously to admit a tranditional kernel. Weak kernels compress the search space for these seach problems. A search problem admits a weak kernel when its search space is only depended on the size of the optimum solution, not depended on the size of the input instance. We apply weak kernels to the sorting unsigned multi-chromosome genomes by DCJ operations problem, the sorting unsigned permutations by reversals problem, the complement of maximum strip recovery problem and the minimum co-path set problem, then devise FPT algorithms for these problems.Ⅱ. On genome rearrangement.2. Sorting unsigned linear genomes by the DCJ operations.Double cut and join unifies all the tranditional genome rearrangement operations, and is widely used to compute the genome rearrangement distance. Though the DCJ distance of signed genomes can be computed in polynomial time, as for unsigned genomes, the problem is NP-hard. We devise a 3/2-approximation algorithm by seeking half of the maximal 2-cyclys from the breakpoint graph. Since all the 1-cycles in the breakpoint graph can be reserved, we obtain a weak kernel of size 2K, then propose an FPT algorithm which runs in O*(4k) time.3. Sorting permutations by short block-moves.Block-move is one of the popular operations for genome rearrangement. A short block-move is an operation on a permutation that moves an element at most two positions away from its original position. We devise an exact polynomial time algorithm for sorting a special kind of sub-permutations called umbrella. Then we propose a (1-ε)-approximation algorithm for sorting some special permutations. We obtain a new lower bound of the short block-move distance and prove that the approximation factor of the new algorithm is at most 14/11.4. Minimum common string partition problem.Minimum Common String Partition (MCSP) has drawn much attention due to its application in genome rearrangement. Its parametrized complexity is still open. In this paper, we investigate three variants of MCSP:MCSPc,which requires that there are at most c symbols in the alphabet; d-MCSP, which requires the occurrence of each symbol to be bounded by d; and x-balance MCSP, which requires the length of blocks not being x away from the average length. We show that MCSPc is NP-hard when c≥2. As for d-MCSP, we present an FPT algorithm which runs in O*((d!)2k) time. We also devise an FPT algorithm for the special case x-balance MCSP parameterized on both k and x, wich runs in O*((2x)kk!) time.Ⅲ. Handling noisy and missing information in genomes.5. Maximal strip recovery problem.Given two genomic maps G1 and G2 each represented as a sequence of n gene markers, the maximal strip recovery (MSR) problem is to retain the maximum number of markers in both G1 and G2 such that the resultant subsequences, denoted as G*1 and G*2, can be partitioned into the same set of maximal substrings of length greater than or equal to two. Such substrings can occur in the reversal and negated form. The complementary maximal strip recovery (CMSR) problem is to delete the minimum number of markers from both G1 and G2 for the same purpose, with its optimization goal exactly complementary to maximizing the total number of gene markers retained in the final maximal subsequences. Both MSR and CMSR have been shown NP-hard. We compute a weak kernel of size 18K for CMSR, which is the first non-trivial linear weak kernel bound. Moreover, we present an O(3kn2) FPT algorithm and a 3-approximation algorithm for the CMSR problem.6. Scaffold filling problems.Motivated by the trend of genome sequencing without completing the sequences of the whole genomes, we studied the problem of filling an incomplete multichromosomal genome (or scaffold)I with respect to a complete target genome G such that the resulting genomic distance betwee I' and G is minimized, where I' is the corresponding filled scaffold. We investigate the one-side and two-side scaffold filling problem under the breakpoint distance, the double cut and join distance, the minimum common substring partition and the maximum commom adjacencies for genomes with or without repetitions. We show that the two-sided scaffold filling problem is polynomially solvable for permutations under the breakpoint distance and the DCJ distance. However, when the input genomes contain some genes which appear more than twice, even the one-sided scaffold filling problems become NP-complete. Finally, we devise a 4/3-approximation algorithm for the one-side scaffold filling problem to maximinzed the number of commom adjacencies for genomes with repetitions.Ⅳ. On ancestral genome reconstruction.7. Breakpoint distance between PQ-trees and PQ-tree median problem.The breakpoint distance between PQ-trees is computing the minimum breakpoint distance among all pairs of permutations generated respectively by the two considered PQ-trees. The PQ-tree median problem is to seek a permutation from the PQ-tree that minimizes the sum of the breakpoint distances to the input p permutations. We show that the breakpoint distance between PQ-trees problem is NP-hard. For p≥3, the PQ-tree median problem is trivally NP-hard since it generalizes the classical breakpoint median problem, we show that this problem is fixed-parameter tractable with respect to the value of the breakpoint distance by design an algorithm running in O*(3k) time. If the PQ-tree can be represented by a graph, then the PQ-tree median problem has a close relation with the minimum co-path set problem8. Minimum co-path set problem. Given a simple undirected graph G, a co-path set is a set S of edges in G whose removal leaves a graph in which every connected component is a path. In the minimum co-path set problem, we need to compute a minimum co-path set in G. the minimum co-path set problem is NP-complete since it generalized the Hamiltonian path problem. We show that the minimum co-path set problem admits a weak kernel of size 5K, and devise an FPT algorithm running in O*(6k/2) time.To further study the work stated in this thesis, the author proposes the following future directions:1. Applying weak kernel to other problems, especially to those problems unlikely to have small tranditional kernels.2. Computing the DCJ distance on doubled genomes.3. To obtain a more small kernel (15K) for the CMSR problem.4. The complexity of the PQ-tree median problem for p=1.5. The parameterized complexity of the minimum common string partition problem.6. Fast algorithms and small approximations for the minimum co-path set problem.
Keywords/Search Tags:Weak Kernel, Approximation Algorithm, Fixed-Parameter Tractable
PDF Full Text Request
Related items