Font Size: a A A

Study On Alignment Method Of Biological Multiple Sequences And Key Technologies

Posted on:2023-12-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:H LiuFull Text:PDF
GTID:1520306902453244Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Biological sequence analysis is a crucial task in the field of bioinformatics,and it is the primary way to understand the structure and function of biological macromolecules,as well as the connections and differences in biological evolution.Due to the rapid development of sequencing technology,a large number of biological sequences such as nucleotide sequences have been generated,and it is becoming more urgent to improve the processing capacity of sequence analysis algorithms,especially with the implementation of the 1000 Genomes Projects and Earth BioGenome Projects.Multiple Sequences Alignment(MSA)is an important work in biological sequence analysis,and the results can provide more biological information for the identification and quantification of conserved regions or functional motifs,estimation of evolutionary divergence,and analysis of the ancestral sequence.For large-scale nucleotide sequences,the existing MSA algorithms based on the dynamic programming method have high computational complexity,and the performance of divide-based MSA algorithms are limited to short seeds division resulting in too many and inaccurate positions.Therefore,based on the background of multiple sequence alignment for nucleotide sequences,for the homologous sequence,this dissertation proposes a multiple sequence alignment algorithm based on longer common seed dividing,and improves the key technologies of index of the MSA algorithm.For non-homologous sequence,this dissertation proposes an efficient grouping algorithm,which makes the basic MSA algorithm applicable to more general sequence sets.Specifically,the works and innovations of this dissertation include:1.Multiple sequence alignment method based on longer common seed dividingMultiple sequence alignment needs to align multiple sequences at the same time,and its computational complexity increases with the increase of sequences number and length.However,most classical MSA algorithms are difficult to handle large-scale multiple sequences,especially long sequences.Therefore,some recent aligners adopt an efficient divide-and-conquer strategy to divide long sequences into several short subsequences by vertical or horizontal division,where vertical division can more effectively solve the complexity problem of long sequences.Selecting the common segments(i.e.seeds)is critical to the vertical division of sequences because it directly affects the accuracy and time cost.So we proposed a novel chaining algorithm based on long common seeds,the algorithm is used to improve the performance of multiple sequence alignment method based on vertical division.We use FM-index to extract long common seeds and then find the optimal chain by heuristic method,the sequences can be divided by the chain.Finally,the short sub-sequences are aligned respectively and merged into the final alignment.Experimental results show that our proposed algorithm FMAlign outperforms existing MSA algorithms such as MAFFT,HAlign and FAME when aligning long viral and bacterial genomes and human mitochondrial genomes.2.Long common seeds extraction method for multiple sequencesLong common seeds have fewer occurrences on the sequence and are more likely to be actual matches,so long common seeds extraction is crucial for multiple sequence alignment.At present,the extractions of long common seeds belong to naive algorithms,that is,FM-index or Hash index is used to extract short common seeds of fixed length and then merged them to longer common seeds,and these methods are relatively complex.We propose a new long common seeds extraction algorithm for multiple sequences,bvMEM,to efficiently extract long common seeds with the help of the property that multiple suffixes in lexicographic order have common prefix.The bvMEM is modified and optimized based on FM-index indexed basic data Suffix Array(SA),BurrowsWheeler Transform(BWT),and Longest Common Prefix(LCP).The proposed method quickly determines the common string of multiple sequences according to the SA array and the LCP array,the BWT array is used to verify whether a common string can be continued to extend,thus ensuring that the longest common string(long common seed)is found.To further reduce the index space,the filtering and sampling methods of LCP are designed.The experimental results show that the proposed method has good performance on the multiple sequence datasets,and the calculation speed is at least twice as fast as existing tools.The improved FM-index is applied to the FMAlign,and the performance is improved.3.Multiple sequence grouping method based on common seed metricMultiple sequence alignment algorithms are time-consuming and may not yield biologically plausible alignment when directly processing non-homologous sequences.Therefore,it is necessary to perform homology grouping,and then perform multiple sequence alignment of each sequence group.The existing Multiple sequence grouping can be preliminarily classified into two categories:reference-based and reference-free.The former uses a simple sequences alignment algorithm to group the sequences;the latter uses a greedy algorithm to select the entire sequence of each group,then grouping is performed by sequences alignment,but when applied in multiple sequences with differences is not only time-consuming but can also lead to incorrect groupings.Therefore,this paper proposes a new sequence grouping method,which uses the DBSCAN algorithm to group sequences,it is no longer based on entire sequences,but focuses on the overall closeness of the sequences.To further improve the grouping speed,the number of common seeds is used to measure the similarity,and the bvMEM is modified to quickly count common seeds among multiple sequences.The experimental results show that the proposed algorithm has higher classification accuracy than the popular cd-hit and vserach,and the computation time is comparable with these classical algorithms.
Keywords/Search Tags:Multiple sequences alignment, Multiple sequences grouping, Long common seed, Sequence division, Index technology
PDF Full Text Request
Related items