Font Size: a A A

Studies On Algorithms For Genome Scaffold Filling Problem

Posted on:2014-08-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:N LiuFull Text:PDF
GTID:1268330425462096Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Genome scaffold filling problem is a new combinatorial optimization problem in computational genomics. Computational genomics is a field analyzing, modeling and computing genomic data by computer and information technology to acquire biological information. Combinatorial optimization is a problem about finding the optimal solution from the feasible solution of combinatorial problems. These combinatorial problems often have simple description, but their optimal solution is difficultly computed, such as famous Traveling Salesman Problem, Bin Packing Problem, and Graph Coloring Problem. Genome scaffold filling is one of new combinatorial optimization problems.The typical problems in computational genomics include genome sequencing and assembly, genome comparison, and genome restructuring. Previous conclusions show that many of these problems are NP-hard. Genome scaffold filling containing repeated genes is one of them. For these NP-hard problems, the design and analysis of polynomial-time approximation algorithm is more feasible and practical.Biological sequencing technology rapidly develops resulting in the first generation, next generation and third generation sequencing technology. Large scale sequencing data can be obtained within a shorter run and the length, depth and coverage of the sequencing has greatly improved. Currently, most of the complete genome sequence is not directly obtained by biological sequencing but is assembled by assembly software. Because of the coverage of sequencing and the limitation of assembly software, the final assembled sequence is often not complete but the framework of fragments. The typical frameworks include genome framework scaffold and contig. Using the incomplete genome fragments for biological analysis and research will affect the accuracy of the results about biological research. By using scaffold filling technology, the integrity and accuracy of the final genome sequence can be improved and the cost of sequencing whole genome will be reduced.According to the actual demand for increasing the integrity of final genome sequence, H. Jiang et al. abstract the model of genome scaffold filling problem after Sankoff et al. firstly propose the problem, i.e. given a set of genes∑, scaffold A and B with X=c(B)-c(A), Y=c(A)-c(B), compute the distance between A’ and B’ after inserting genes of X in scaffold A and inserting genes of Y in scaffold B to obtain scaffold A’ and B’ where the distance is minimum or maximum. There are five kinds of general distance to measure the similarity of genomes:genome rearrangements distance (Double Cut and Join, DCJ), breakpoint distance, exemplar genomic distance, common string partition distance and common adjacency distance. Except common adjacency distance needs to be maximized, problems based on other distance need to ensure that the distance is minimum.According to whether input scaffolds containing duplicate genes, there are two versions of scaffold filling problem:based on permutation or sequence. If both input scaffolds are incomplete (including missing genes), the problem is called two-sided scaffold filling. If only one scaffold is complete, the problem is called one-sided scaffold filling.This paper mainly studies on the one-sided and two-sided scaffold filling problems to maximize the number of adjacency with duplicate genes. In addition, the paper also studies on the information reduction in sequence assembly, i.e. data from biologic sequencing is correctly reduced to improve the efficiency of sequence assembly.The main contributions of this thesis are as follows: 1. The paper modifies the standard definition of scaffold filling problem to maximize the number of adjacency.In the standard definition of SF-MNCA problem, there are two input scaffolds in any instance. If the original scaffolds are directly used, some missing genes may not be inserted because insertions cannot bring any new adjacency which is the premise of1.33-approximation algorithm and algorithms in this paper. The symbol ’#’ is added to each end of scaffold because it is unfair that each end-gene of scaffold has only one adjacent gene to generate adjacency or breakpoint. It ensures that insertion of each missing gene can generate at least one new adjacency and two final scaffolds have same numbers of adjacency or breakpoint.2. A polynomial time algorithm for the special instance in SF-MNCA problem is proposed.Unsigned scaffold filling problems with duplicated genes based on breakpoint distance, double cut and join distance and adjacency distance have no polynomial time algorithm because they are shown to be NP-Complete problem. But for some special instances of these problems, the polynomial algorithm can be found. In this paper, a kind of special instance for SF-MNCA problem has been found and a polynomial algorithm is designed. For scaffold filling problem, it is very important that more adjacencies can be generated after missing genes are inserted in two scaffolds. That is, each insertion of genes can eliminate as many as breakpoints between two scaffolds. Therefore, the paper focuses on the type of breakpoints. Breakpoints in each scaffold are divided into three sets according to the type of each bp-gene constituting a breakpoint. When there is no this kind of breakpoint in the instance of SF-MNCA problem, a bp-gene appears in the set of missing genes and another bp-gene appears in some breakpoint in the scaffold that these missing genes will be inserted, a polynomial algorithm for SF-MNCA problem can be designed. Its run time is O(n3). The feasibility of the algorithm and optimality of the solution computed by the algorithm are proved in this paper. And the algorithm can ensure each insertion of one gene will generate one new adjacency. The algorithm is the premise of following approximation algorithms.3. A polynomial time approximation algorithm for Two-Sided SF-MNCA (TSSF-MNCA) problem is designed whose factor is improved to1.5.Comparing with the one-sided SF-MNCA problem, two-sided version is more general and difficult. Until now, there is a travial2-approximation algorithm for the problem devised by H. Jiang et al.. By analyzing the characteristics of the two-sided SF-MNCA instances, a bound of the number of common adjacnecies in any optimal solution for the two-sided SF-MNCA problem is obtained. According to the bound, an approximation algorithm with greedy strategy is designed and the1.5-factor is proved.4. A polynomial time approximation algorithm for One-Sided SF-MNCA (OSSF-MNCA) problem is designed whose factor is improved to1.25.For OSSF-MNCA problem, there is a1.33-approximation algorithm which uses greedy strategy. To obtain more adjacency, maximum matching and local searching are used in the approximation algorithm. The more Type-1substring can be inserted, the more adjacencies can be generated. By maximum matching technology, most1-Type-1substring can be found. And the number of2-Type-1substring is increased by local searching technology. Therefore, the factor can be improved to1.25. The proof about the factor uses the bipartite graph method in graph theory to analyze the relationship between optimal solution and feasible solution. By strict formula derivation, the factor of the approximation algorithm is proved to be1.25.5. A transitive reduction algorithm in the Weighted Bi-directed Overlap (WBO) graph is proposed. This problem is about information reduction in sequence assembly procedure before scaffold filling procedure. With the advances in biological sequencing technology, the number of gene fragments directly obtained by biological instrument is massive. The assembly software will work hard for the vast amount of raw data. In order to improve the efficiency of the assembly software, the raw data can be reduced and the reduction must not affect the final assembled sequence. In this paper, a transitive reduction algorithm in the weighted bi-directed overlap graph is proposed which is a common model of graph theory for sequence assembly. Redundant edges are effectively cut and the scale of the graph is simplified. The efficiency of computing the path in the graph is improved. The software is programmed to verify the effectiveness of the algorithm.The following future work mainly contains:1. Whether there are other special instances which can be solved by polynomial time exact algorithm in SF-MNCA problem.2. Computing the common adjacency distance in SF-MNCA problem where each gene appears in each scaffold for2times or3times.3. Research on the approximation algorithm whose factor is less than1.5for TSSF-MNCA problem.4. Research on the approximation algorithm whose factor is less than1.25for OSSF-MNCA problem.5. Research on the approximation algorithm whose factor is less than2for SF-MNCA problem without symbol’#’.6. Research on the assembly algorithm based on graph theory.
Keywords/Search Tags:Scaffold Filling, Approximation Algorithm, Genome, CommonAdjacency
PDF Full Text Request
Related items