Font Size: a A A

Research On Multiple Sequence Alignment Based On The Star Alignment Strategy And Suffix Trees

Posted on:2022-11-15Degree:MasterType:Thesis
Country:ChinaCandidate:J N ChaoFull Text:PDF
GTID:2518306764970619Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The developments in sequencing technologies have enabled unprecedentedly fast sequencing speed and large-scale sequencing capabilities.There is increasing number of data sets that biologists are facing,which is challenging the automated sequence analysis tools.Multiple sequence alignment is one of the basic tasks in the processing of biological sequences and the accuracy of alignment affects the subsequent analyses.The complexity in this problem makes the multiple sequence alignment practitioners try to propose different algorithms to meet the varied challenges.This work is based on HAlign,which is a piece of software previously developed by the author's laboratory,and tries to carry out multiple improvements in time complexity,space complexity,and alignment quality of HAlign.HAlign adopts the star alignment strategy as its framework and utilizes suffix trees to accelerate pairwise sequence alignment.This thesis presents the work that is aimed to improve HAlign in the following aspects: the data structure of the suffix tree,search algorithms corresponding to suffix trees,the screening of search results,and the storage of temporary information in the process of alignment.In order to confirm the effect of the algorithms proposed,this thesis either gives a simple proof or presents some experiments on the implementation of these algorithms.1.Another coding method for nucleotides in DNA/RNA and a corresponding improvement on the data structure of suffix trees make the program store the sequences more effectively and accelerate some operations on the sequences,especially the operations of suffix trees.This thesis also presents a method that utilizes the suffix links in a suffix tree to decrease the average-case complexity of an unsuccessful search to constant.2.The original method for screening the search results of suffix trees is greedy,which does not guarantee a global optimum.In order to avoid the potential risks of mistaken searches,this thesis presents a new method,which consists of three steps,i.e.,construction of a directed graph based on the search results,topological sorting and dynamic programming.The loop invariant of the dynamic programming guarantees a global optimum of the screening,thus reducing the possibility that an inappropriate search result is adopted.In the end,HAlign before and after the improvements are compared in terms of time consumption,space requirements and alignment quality along with several pieces of multiple sequence alignment software that are popular among biologists.It can be concluded that HAlign has benefited from these measures and surpassed the old version in terms of both the time efficiency and the accuracy.And the alignment quality of HAlign for nucleic acid sequences with high similarities is comparable with those pieces of multiple sequence alignment software.
Keywords/Search Tags:Multiple Sequence Alignment, String Algorithms, Suffix Trees, Heuristic Algorithms, Algorithm Optimization
PDF Full Text Request
Related items