Font Size: a A A

Algorithms For Sorting By Short Swap And Longest Common Exemplar Subsequence

Posted on:2022-07-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:S ZhangFull Text:PDF
GTID:1488306608472354Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
Comparative genomics is an important branch of computational biology.Its core is to establish the connection between different genomes,to determine the genetic relationship between different species,to deduct the evolutionary laws of species,to explore the function of some base sequences,and to help researchers to classify species,etc.The problem of sorting genomes by rearrangements,which asks to compute rearrangement distance between two genomes and find a shortest sequence of rearrangements that transform one genome into the other,is one of the typical problems in comparative genomics.Rearrangements can be formalized as three basic operations:reversal,translocation and transposition.Many studies have shown that genome rearrangement is a general pattern of biological evolution.Dobzhansky and Sturtevant found that the drosophila pseudoobscura genome can be transformed into drosophila miranda genome with 17 reversals.Jeffrey Palmer and his colleagues compared the mitochondrial genomes of Brassica oleracea(cabbage)and Brassica campestris(turnip),which showed that the two genomes just differ in gene order.Due to these,sorting by reversal is verified useful in tracing the evolutionary path between genomes.Reversal frequently occurs in genomes in the form of short swap.Therefore,studying the problem of sorting by short swaps has become one of the hot topics.Finding conserved sets of genes in genomes is another typical problem in comparative genomics.It is an attractive topic in bioinformatics,medical-informatics and computer science and plays an important role in kinds of bioresearch such as seed breeding as well as developing drugs or vaccines.Factors leading to instability of genomes such as rearrangement and mutation occur more often in repetitive regions in a genome,the conservation of a sequence of genes is inconsistent with the occurrence of two or more genes that are repetitive in the sequence.A gene sequence that does not contain duplicate genes is refered to as an exemplar gene sequence.Thus,a set of exemplar genes is usually agreed eligible to stand for a set of conserved genes.Similarity measures of genomes are introduced for measuring how good a conserved gene set is.The number of breakpoints,adjacencies and common subsequences have usually been used as the similarity measure in comparative genomics.Since specific common subsequences imply to highly conserved sets of genes,finding the longest order conserved sequences of exemplar genes in genomes will be helpful for us to find conserved gene sets in genomes and explore the evolution of the organisms.In practice,such as in breeding,sometimes researchers have known that some genes are conserved,this asks effective algorithms to find a longest common exemplar subsequence with genes that have been fixed conserved.Moreover,due to the development of sequencing technology,some genome data have been very perfect.Thus,in practice,the scale of genomes tends to be very large.This requires the algorithm to handle larger data when looking for conserved gene subsequences in genomes.The existence of these issues asks new problem models and algorithms to deal with practical needs.The main contributions and innovation of this thesis are as follows:1.On sorting by rearrangementsSorting by short swaps was first proposed by Heath and Vergara,who also proposed an approximation algorithm that achieves a performance ratio of 2 for this problem,which is the best approximation ratio up to now.The complexity of sorting a permutation by short swaps remains open.In this paper,we study sorting by short swaps with two special kinds of permutations.1.Recognize and sort a happy permutation.We refer to a permutation as happy if it can be transformed into the identity permutation using as many short swaps as one third times the number of its inversions.We present a sufficient and necessary condition to decide for a permutation to be happy or not.Based on this condition,we propose an O(n)time algorithm to verify a permutation for if it is happy,where n is the number of elements in the given permutation.If a permutation is happy,we give an O(n2)time algorithm to find a sequence of as many short swaps as one third times the number of its inversions that transforms it into the identity permutation.Using the condition for a permutation to be happy,we show that there are at least(?)happy permutations of n elements for n? 3.2.Recognize and sort a lucky permutation.A permutation is referred to as lucky if the permutation can be transformed into the identity permutation using as many short swaps as one fourth times the length sum of the permutation's vectors.Heath and Vergara proposed an O(n2)time algorithm to decide whether a permutation is lucky.In this paper,we further propose a sufficient and necessary condition to decide for a permutation to be lucky or not.Based on this condition,we propose an O(n)time algorithm to verify a permutation for if it is lucky.If a permutation is lucky,we give an O(n2)time algorithm to find a sequence of as many short swaps as one fourth times the length sum of the permutation's vectors that transforms the permutation into the identity permutation.Using the condition for a permutation to be lucky,we show that there are at least 2n-4 lucky permutations of n elements for n? 3.II.On finding conserved gene subsequences in genomes3.The longest common exemplar subsequence problemThe longest common exemplar subsequence problem,known as the repetition-free longest common subsequence problem,was first proposed by Adi et al and was proved to be NP-Hard.If in one of the two given genomes,any two genes of the same gene family are in at most s spans,we propose a dynamic programming algorithm to solve it with O(mns2s)space and O(mns4s)time,where m,n stand for the gene numbers of those two given genomes4.The longest common exemplar subsequence with indexed genes problemUnder the situation that some genes are known to be conserved,in motivation of pursuing a longest common exemplar subsequence with genes that have been fixed conserved,we propose a new problem called longest common exemplar subsequence with indexed genes.We propose a dynamic programming algorithm for this problem.Since there is still a lack of effective software that can find conserved subsequences of different genomes,we have developed a Java software based on this dynamic programming algorithm.We run this algorithm on human and gorilla genomes.The results show that the algorithm can reach a solution whose length can always come very close to a real longest common exemplar subsequence of the two given genomes in a short period of time.This shows that our algorithm is very effective for processing real genome data.Moreover,we found some consecutive DNA subsequences that do not overlap with annotated genes in a human chromosome as well as a gorilla chromosome and,which may avail to trace the divergence history of primate species and find new genes or motifs in human as well as gorilla genomes.
Keywords/Search Tags:Algorithm, Comparative genomics, Short Swap, Conserved gene, Exemplar subsequence
PDF Full Text Request
Related items