Font Size: a A A

Algorithm Analysis And Design Of Gene Tandem Duplication

Posted on:2022-10-16Degree:MasterType:Thesis
Country:ChinaCandidate:W Z TianFull Text:PDF
GTID:2480306608471844Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
The transformation of gene sequences contains basic operations such as duplication,rotation,and flip,which lead to the transformation of gene arrangement order on gene sequences.The gene duplication problem is one of the classical problems in bioinformatics,which is to explore the process of duplication transformations between different gene sequences and calculate their minimum number of duplication.The minimum number of duplication means that the interconversion between two gene sequences can be accomplished by the minimum number of gene duplication operations,which is of great scientific importance for inferring the evolution of biological species,heritability and variation of organisms,and for obtaining the parentage between similar species.The study of the problem of different tandem duplication of genetic sequences started in the 20th century.The natural language problem,the so-called duplication system,emerged in 1984:given a string X,a tandem duplication is used starting from S(any number of them can be used),i.e.,a duplication rule of the form ASB?ASSB,where S can be any substring of X(one such operation will generate a duplicate segment in the new X,i.e.,the substring SS.)The problem of determining the complexity of computing the tandem duplication distance between two given strings,i.e.,whether the tandem duplication distance can be computed in polynomial time,was first raised by Leupold et al.in 2004.Afterwards,tandem duplication was shown to be NP-hard for the case of unbounded symbol sets.Subsequently,Ferdinando Cicalese and Nicolo Pilati significantly improved this result and showed that the tandem duplication distance problem is already NP-hard for strings on symbol sets of size less than or equal to 5,2019 Binhai Zhu et al.obtained a hard result for W[1]-hardness that may be of independent interest with a new problem for cost effective subgraphs and showed that computing the exemplary tandem duplication distance between two strings is fixed-parameter tractable..In this paper,we focus on the generation of the minimum number of duplication for suffix duplication,tandem duplication and the minimum copy number generation problem under tandem duplication and complete the design and program implementation of the relevant algorithms by designing a reasonable algorithm to judge the duplication operation process between two gene sequences,and finally analyze the realized results scientifically.The main research of this paper is as follows.(1)For the suffix duplication problem:an algorithm is designed to perform deletion using an exhaustive strategy under the condition of constraints.We use the concepts and properties of adjacency,breakpoint,and suffix duplication to determine the generation of sequences.Each deletion operation is guaranteed to be optimal based on matching the sequence generation with the value of the breakpoint.The cycle is repeated to find the breakpoints that meet such conditions and remove them,and eventually a sequence without breakpoints will be obtained.By comparing it with the original sequence,the judgment result is obtained and the correctness and optimality of the algorithm is proved.(2)For the problem of tandem duplication with constraints,an algorithm is designed to perform deletion using an exhaustive strategy.By restricting the duplication of certain sequence segments as a way to ensure that a non-zero breakpoint is generated after each duplication and to eliminate all breakpoints between sequences,the deletion process is documented and the feasibility of the algorithm is verified through the program;for genomic sequences containing different number of elements and containing duplicates obtained by tandem duplication,a deletion algorithm based on greedy strategy is proposed to select the best positioned duplicate fragment characters for deletion in different cases:according to the greedy algorithm,the breakpoint with the largest value of the current breakpoint is selected for deletion operation each time,and the deletion is verified by the program.(3)For the problem of The Minimum Copy Number Generation for genome alignment under tandem duplication,we classify the copy number profile(CNP)into two types:single-peaked and multi-peaked.For the single-peaked case,a recursive algorithm is given to solve the problem optimally in linear time.The two numbers with the smallest values at these two ends and the segments of numbers covered by them are selected for the division operation according to the given gene copy number profile(CNP).The recursive loop is performed until a CNP with all values of 1 is obtained.Also for CNPs with multiple peaks,the analysis yields an upper bound on the number of runs of the algorithm.The main innovations of this paper are1.The "value of breakpoints" in gene sequences is discovered and defined,which leads to the design of a deletion algorithm for the suffix duplication problem.The properties related to suffix duplication are proved and a proof of the correctness of the algorithm is given.2.A deletion algorithm based on an exhaustive strategy is designed and implemented to eliminate all non-zero breakpoints in a sequence by adding constraints to tandem duplication.The feasibility of the algorithm is verified using the program.3.A greedy deletion algorithm is designed to remove the duplicate fragment information generated by tandem duplication of gene segments and verified by the program.4.A recursive algorithm is designed to optimally solve the tandem duplication-minimum copy number generation problem in linear time such that all values in the CNP become 1 and its optimality is proved.
Keywords/Search Tags:adjacency, breakpoint, tandem duplication, suffix duplication, CNP
PDF Full Text Request
Related items