Font Size: a A A

Fragment Topological Order Guided Multiple Sequence Alignment

Posted on:2022-11-13Degree:MasterType:Thesis
Country:ChinaCandidate:X LaiFull Text:PDF
GTID:2480306758489474Subject:Biophysics
Abstract/Summary:PDF Full Text Request
Many in silico analyses depend on multiple sequence alignment(MSA)methods.In practical application,it is often necessary to perform multiple sequence alignment on more than two sequences.Due to the high computational complexity,the strict multiple sequence alignment algorithm is difficult to cope with the alignment requirement of more than 7 sequences with a length of more than 300 bp.As a result,new heuristics are constantly emerging.Mainstream heuristic algorithms need to build a guide tree to determine the order of sequence addition,because in these algorithms,the order of sequence addition affects the final alignment.The necessary step of building a guide tree limits the application of these algorithms to large-scale multiple sequence alignments.On the other hand,various genome projects have been launched,generating massive amounts of genomic data.However,existing heuristic algorithms are also difficult to deal with large-scale multiple sequence alignments between whole genomes.Therefore,a new multiple sequence alignment algorithm is needed,which is developed and designed for the data characteristics of multiple sequence alignment between genomes.Here,we propose a novel heuristic algorithm,Fragment topological order guided multiple sequence alignment(FTO?MSA).The algorithm is mainly divided into two steps,the determination of the alignment architecture and the alignment of the unaligned parts.When determining the general structure of the alignment,we take advantage of the data characteristics of similar sequences,and use the equal-length sequence fragments intercepted by the sliding window as the basic unit of the alignment to quickly align the same nodes between sequences,and the accuracy of alignment is judged by the positional relationship of sequence fragments in the original sequence.The positional relationship judges the accuracy of the alignment.This part of the calculation is completed by constructing a directed acyclic graph with equal-length sequence segments as nodes and front and back position relationships as edges.We refer to this graph as a topological orderdependent fragment graph.Since the structure of the graph can be changed according to the newly added sequence,the final structure of the graph is independent of the sequence joining order,and there is no need to build a guide tree.Then,based on this,sequence alignment is performed on the remaining small unaligned parts.The merging of the same nodes in the composition process plays the role of information deredundancy and reduces repeated calculations.From the two aspects of guiding tree construction and multiple sequence alignment construction,the required calculation is greatly reduced,and the efficiency is improved.In order to test the algorithm,we selected 63,964 SARS-Co V-2 genomes with a length of about 30,000 bp in GISAID as the test set and divided them into different subtest sets,and compared the two methods of the MAFFT algorithm to test the efficiency and accuracy of the algorithm.In the case of long sequence length and large number of sequences,our method has great advantages.In the test,when the sequence length is about 30000 bp and the number of sequences is equal to 30000,the CPU time required by this algorithm is only 1/7 of the FFT-NS-2 method in MAFFT.The sequence alignment of 63,964 SARS-Co V-2 whole genomes can be completed in about 28 hours.In multiple sequence alignment accuracy tests with different numbers of sequences,the results show that our algorithm is comparable to MAFFT in terms of accuracy.
Keywords/Search Tags:Fragment topological order, Bioinformatics, Multiple sequence alignment, Sequence analysis
PDF Full Text Request
Related items