Font Size: a A A

Research On Algorithms Of Scaffold Filling Problems

Posted on:2014-01-12Degree:MasterType:Thesis
Country:ChinaCandidate:L WangFull Text:PDF
GTID:2248330398460015Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Since the end of last century, studies on algorithms for genome comparison have always been a very active research field. The advancement of random sequencing technology makes the cost of genome sequencing decreased significantly and at the same time leads to incomplete forms of the final released genomes, which are called draft genomes.In some situations, the use of draft genomes for genomic analysis makes many analyses and interpretations tentative and would introduce some errors and leads to particular problems in the comparative study of gene order. Now many algorithms for genome rearrangement require whole genome data, and items whose chromosomal location is unknown cannot be part of the input. So even if scaffolds or contigs of organism are sequenced, we still cannot calculate the rearrangement distance between two genomes. This puts many draft genomes outside the scope of currently available comparison technology, even though these data may be suitable to other goals of genomics. In recent years, Scaffold Filling problem have caused more and more attention of algorithm and complexity researchers.Scaffold Filling Problem is to fill the missing genes into scaffolds and get the complete genomes which are close to the original genomes, using combinatorial methods. So we can make use of these filled complete genomes (The filling method must be biologically meaningful). Munoz and D. Sankoff first studied the breakpoint distance and genomic distance (DCJ distance) for Scaffold Filling problem and designed a polynomial time algorithm. In addition, studies have shown that there exists polynomial time algorithm for Scaffold Filling under the Minimum Permutation Breakpoint Distance (SF-PBD). However, Scaffold Filling to Maximize the Number of Sequence Adjacencies (SF-MNSA) with gene duplication is NP-Hard problem.This paper mainly studies the Scaffold Filling problem to maximize the number of sequence adjacencies. That means a genome contains two chromosome sequences and in each chromosome the same gene may appears multiple times. We try to find a filling algorithm which makes the adjacency number between two filled gene sequences as much as possible. In this paper, we first discuss the problems of the existing greedy algorithm for One-sided SF-MNSA, and then we design and implement an improved-greedy algorithm, which can solve these problems. In addition, we propose a backtracking algorithm and a branch and bound algorithm for One-sided SF-MNSA. Finally, we analyze and compare the results of these algorithms by experiments. The main contributions of this paper are as follows:(1) We put forward an improved-greedy algorithm for One-sided SF-MNSA. It effectively solves two types of special instances and prevents the case that some missing gene can’t be inserted into the scaffold. The time complexity of this improved-greedy algorithm is O(nJ). In addition, experiments are carried out to verify the performance of our algorithm by using large sets of randomly generated input gene sequences.(2) We present a backtracking filling method for One-sided SF-MNSA. It can get optimal solution which means the maximum adjacency number and optimal filled sequence. This solution can be compared with solution of approximation algorithm and used to evaluate other approximation algorithm, which has important significance.(3) We design a branch and bound algorithm for One-sided SF-MNSA. It can effectively reduce the number of enumeration, so as to reduce the time complexity. The time complexity of this branch and bound algorithm is O (n2(n-k)n-m) and the space complexity is O(n2). Experiments show that, although the filling time of branch and bound algorithm is longer than the improved-greedy algorithm, but the result is obviously better. When there are more initial adjacencies between randomly generated I and G, the branch and bound algorithm runs faster.
Keywords/Search Tags:Genome Comparison, Scaffold Filling, NP-Hard, Greedy Algorithm, Backtracking, Branch and Bound
PDF Full Text Request
Related items