Font Size: a A A

Research On Fast Multiple Sequence Alignment Based On Clustering

Posted on:2024-04-16Degree:MasterType:Thesis
Country:ChinaCandidate:J T ChenFull Text:PDF
GTID:2530307079964039Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Multiple Sequence Alignment(MSA)compares the similarities and differences between numerous biological sequences,including RNA,DNA,and protein sequences.Multiple sequence alignment is essential for various sequence analyses,including identifying important sites and phylogenetic analysis.With the development of nextgeneration sequencing technology,the size of biological sequences is exploding exponentially.However,it is difficult for the existing multiple sequence alignment software to handle large-scale sequence data.The problem of multiple sequence alignment based on SP score is an NPC problem,which cannot be solved in polynomial time,so it can only be solved using the approximate algorithm.Currently,most tools adopt the progressive alignment strategy,which first needs to create a guide tree and perform profile-profile alignment;both steps have high time complexity.This thesis mainly focuses on algorithm research and improvement in three aspects of progressive alignment: constructing a guide tree,pairwise alignment,and profile alignment.We also introduce a win-win alignment algorithm combining central and progressive alignment strategies.The main research content of this thesis includes the following aspects:(1)Hierarchical clustering is the common algorithm for constructing an alignment guide tree,but its time complexity is relatively high.In the case of large-scale data,the time of tree construction will be very large.This thesis proposes an algorithm that can quickly construct a guidance tree,significantly reducing the time complexity of constructing alignment guide tree by combining sequence and hierarchical clustering.(2)A novel method of pairwise sequence alignment is proposed to accelerate pairwise sequence alignment with high similarity.First,the algorithm constructs FMindex for the longer sequence in the two sequences,then uses FM-index to find out the common substrings in the two sequences,and then uses the substring length,substring offset,and dynamic programming algorithm to screen out compatible common substrings,and then directly aligns the common substrings.Finally,a dynamic programming alignment algorithm aligns unaligned segments in the sequence.For the profile alignment problem,this thesis first simplifies a profile into a consensus sequence and then applies the algorithm above.(3)This thesis uses clustering to combine the progressive alignment and center star alignment algorithms effectively.It proposes a win-win multiple sequence alignment method,which greatly reduces the alignment’s time complexity and guarantees accuracy.This method solves the problem of the alignment of ultra-large data sets.Finally,this thesis compares the performance of standard alignment software and WMSA2 on different datasets,including alignment accuracy,time,and memory consumption.According to the results,WMSA2 uses less time and memory and guarantees alignment accuracy.
Keywords/Search Tags:Multiple Sequence Alignment, Guide Tree, Progressive Alignment, CenterStar Alignment, FM-index
PDF Full Text Request
Related items