Ultra-large Multiple Sequence Alignment Based On Distributed Computing

Posted on:2019-08-10

Degree:Master

Type:Thesis

Country:China

Candidate:S X Wan

Full Text:PDF

GTID:2370330626452399

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Multiple Sequence Alignment(MSA)plays a key role in the analysis of biological sequences and structure,function,evolution,and other fundamental areas of bioinformatics.With the substantial increase in the scale of next-generation bio sequencing,the existing multiple sequence alignment methods have shown significant performance bottlenecks and even powerlessness under large-scale data.Addressing in this problem,this paper proposes a series of algorithms to accelerate the multiple sequence alignment processes on the HDFS storage system and the Spark framework.This paper contains the following parts:The Smith-Waterman algorithm for large-scale protein multiple sequence alignment and the suffix tree suffix array algorithm for large-scale nucleic acid multiple sequence alignment were implemented.The Smith-Waterman algorithm is a local optimal alignment algorithm that uses the dynamic programming idea to calculate a scoring matrix.The comparison result is of high quality but memory-consuming.The suffix tree suffix array algorithm has O(? ? ????)time efficiency,the alignment quality is reliable,and the space complexity can be optimized to O(?)by prefix doubling algorithm.The space complexities of this algorithm and the Smith-Waterman algorithm can be further reduced by the HDFS distributed storage system,with stable high availability.In-depth parallel optimization in Spark cluster environment,we need to give full play to its advantages of elastic loading,memory sharing and distributed storage.In terms of load balancing,the "large variable" broadcast is adjusted by an adaptive algorithm,and the number of blocks in the RDD operator is adjusted to improve network throughput performance.In terms of memory optimization,optimizing data structures and serializing objects adjusts memory reclamation and cache size.In terms of distributed storage,HDFS is used to store large-scale sequences to improve disaster tolerance.In terms of engineering implementation,it adopts object-oriented design and light coupling package,which is conducive to future maintenance and algorithm expansion.A series of protein and nucleic acid sequences with different scales in singlemachine multi-threaded environment and cluster environment are compared horizontally,and some state-of-the-art multiple sequence alignment software such as MUSCLE are compared with the tool of the present invention.The results show that the large-scale sequence alignment algorithm based on Spark computing platform has more outstanding performance than other algorithms in terms of time efficiency,memory efficiency,speedup ratio and quality of results,which proves the important value of this work.Finally,our tool connects to the high-performance distributed cluster and it has been deployed on the website for free access by researchers.

Keywords/Search Tags:

Spark, Smith-Waterman, Suffix Tree, Suffix Array, Multiple Sequence Alignment

PDF Full Text Request

Related items

1	Studied On Gene Sequence Alignment Based On Mixed Suffix Tree And Suffix Array
2	The Design And Implementation Of A Multiple Sequence Alignment Algorithm Based On Suffix Tree Strategy
3	Multiple Sequence Alignment. Bioinformatics Algorithm
4	LM-Suffix: Research On Gene Sequence Index Structure Based On Suffix Tree
5	Construction Of DNA Sequence Phylogenetic Tree Based On Suffix Tree
6	Research Of Genome Data Compression Algorithm Based On Reference Sequence And Suffix Array
7	Parallel Optimization Design And Implementation Of Biological Sequence Alignment Algorithm
8	Alignment-free Sequence Similarity Analysis And Clustering Algorithms On Biological Sequences
9	Biological Sequence Alignment Algorithm And A Comparative Study
10	Improvement Of Smith Waterman Gene Sequencing Algorithm And Research On Hardware Acceleration Method