Research On Fast Multiple Sequence Alignment Based On Clustering

Posted on:2024-04-16

Degree:Master

Type:Thesis

Country:China

Candidate:J T Chen

Full Text:PDF

GTID:2530307079964039

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Multiple Sequence Alignment(MSA)compares the similarities and differences between numerous biological sequences,including RNA,DNA,and protein sequences.Multiple sequence alignment is essential for various sequence analyses,including identifying important sites and phylogenetic analysis.With the development of nextgeneration sequencing technology,the size of biological sequences is exploding exponentially.However,it is difficult for the existing multiple sequence alignment software to handle large-scale sequence data.The problem of multiple sequence alignment based on SP score is an NPC problem,which cannot be solved in polynomial time,so it can only be solved using the approximate algorithm.Currently,most tools adopt the progressive alignment strategy,which first needs to create a guide tree and perform profile-profile alignment;both steps have high time complexity.This thesis mainly focuses on algorithm research and improvement in three aspects of progressive alignment: constructing a guide tree,pairwise alignment,and profile alignment.We also introduce a win-win alignment algorithm combining central and progressive alignment strategies.The main research content of this thesis includes the following aspects:(1)Hierarchical clustering is the common algorithm for constructing an alignment guide tree,but its time complexity is relatively high.In the case of large-scale data,the time of tree construction will be very large.This thesis proposes an algorithm that can quickly construct a guidance tree,significantly reducing the time complexity of constructing alignment guide tree by combining sequence and hierarchical clustering.(2)A novel method of pairwise sequence alignment is proposed to accelerate pairwise sequence alignment with high similarity.First,the algorithm constructs FMindex for the longer sequence in the two sequences,then uses FM-index to find out the common substrings in the two sequences,and then uses the substring length,substring offset,and dynamic programming algorithm to screen out compatible common substrings,and then directly aligns the common substrings.Finally,a dynamic programming alignment algorithm aligns unaligned segments in the sequence.For the profile alignment problem,this thesis first simplifies a profile into a consensus sequence and then applies the algorithm above.(3)This thesis uses clustering to combine the progressive alignment and center star alignment algorithms effectively.It proposes a win-win multiple sequence alignment method,which greatly reduces the alignment’s time complexity and guarantees accuracy.This method solves the problem of the alignment of ultra-large data sets.Finally,this thesis compares the performance of standard alignment software and WMSA2 on different datasets,including alignment accuracy,time,and memory consumption.According to the results,WMSA2 uses less time and memory and guarantees alignment accuracy.

Keywords/Search Tags:

Multiple Sequence Alignment, Guide Tree, Progressive Alignment, CenterStar Alignment, FM-index

PDF Full Text Request

Related items

1	Multiple Sequence Alignment. Bioinformatics Algorithm
2	Study Of Several Algorithms For Alignment Problem Of Sequence And Sequence Secondary Structure
3	Research On Multiple Sequence Alignment Method Based On Single Molecule Sequencing Data
4	Parallel Optimization Design And Implementation Of Biological Sequence Alignment Algorithm
5	Bioinformatics Multiple Sequence Alignment And Phylogenetic Spanning Tree Of Several Techniques And Algorithms
6	Ultra-large Multiple Sequence Alignment Based On Distributed Computing
7	The Design And Implementation Of A Multiple Sequence Alignment Algorithm Based On Suffix Tree Strategy
8	Study Of Multiple Alignment Algorithm For DNA Sequences Based On Graph
9	Study On Alignment Method Of Biological Multiple Sequences And Key Technologies
10	Alignment And Variant Calling For Third Generation Sequencing Data Based On Alignment Skeleton