Research On Multiple Sequence Alignment Based On The Star Alignment Strategy And Suffix Trees

Posted on:2022-11-15

Degree:Master

Type:Thesis

Country:China

Candidate:J N Chao

Full Text:PDF

GTID:2518306764970619

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

The developments in sequencing technologies have enabled unprecedentedly fast sequencing speed and large-scale sequencing capabilities.There is increasing number of data sets that biologists are facing,which is challenging the automated sequence analysis tools.Multiple sequence alignment is one of the basic tasks in the processing of biological sequences and the accuracy of alignment affects the subsequent analyses.The complexity in this problem makes the multiple sequence alignment practitioners try to propose different algorithms to meet the varied challenges.This work is based on HAlign,which is a piece of software previously developed by the author's laboratory,and tries to carry out multiple improvements in time complexity,space complexity,and alignment quality of HAlign.HAlign adopts the star alignment strategy as its framework and utilizes suffix trees to accelerate pairwise sequence alignment.This thesis presents the work that is aimed to improve HAlign in the following aspects: the data structure of the suffix tree,search algorithms corresponding to suffix trees,the screening of search results,and the storage of temporary information in the process of alignment.In order to confirm the effect of the algorithms proposed,this thesis either gives a simple proof or presents some experiments on the implementation of these algorithms.1.Another coding method for nucleotides in DNA/RNA and a corresponding improvement on the data structure of suffix trees make the program store the sequences more effectively and accelerate some operations on the sequences,especially the operations of suffix trees.This thesis also presents a method that utilizes the suffix links in a suffix tree to decrease the average-case complexity of an unsuccessful search to constant.2.The original method for screening the search results of suffix trees is greedy,which does not guarantee a global optimum.In order to avoid the potential risks of mistaken searches,this thesis presents a new method,which consists of three steps,i.e.,construction of a directed graph based on the search results,topological sorting and dynamic programming.The loop invariant of the dynamic programming guarantees a global optimum of the screening,thus reducing the possibility that an inappropriate search result is adopted.In the end,HAlign before and after the improvements are compared in terms of time consumption,space requirements and alignment quality along with several pieces of multiple sequence alignment software that are popular among biologists.It can be concluded that HAlign has benefited from these measures and surpassed the old version in terms of both the time efficiency and the accuracy.And the alignment quality of HAlign for nucleic acid sequences with high similarities is comparable with those pieces of multiple sequence alignment software.

Keywords/Search Tags:

Multiple Sequence Alignment, String Algorithms, Suffix Trees, Heuristic Algorithms, Algorithm Optimization

PDF Full Text Request

Related items

1	Multiple Sequences Alignment Based On A-Star And DiAlign Algorithms
2	Dna Sequence Alignment Based On Intelligent Computing Research
3	Research On Multiple Sequence Alignment Algorithms In Bioinformatics
4	Research Of Multiple Sequence Alignment Based On Genetic Algorithm
5	Research On Models And Algorithms For Several Key Problems In Sequence Mining
6	The String Pattern Searching Algorithms Based On Suffix Arrays
7	Research On Protein Multiple Sequence Alignment Algorithms And Assessment Of Their Performance
8	Research On Multiple Sequence Alignment Algorithm Of Bioinformatics
9	The Study Of Optimization Algorithm For Multiple Sequence Alignment
10	With Modern Optimization Algorithm Cdma Multiuser Detector