Font Size: a A A

Ultra-large Multiple Sequence Alignment Based On Distributed Computing

Posted on:2019-08-10Degree:MasterType:Thesis
Country:ChinaCandidate:S X WanFull Text:PDF
GTID:2370330626452399Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Multiple Sequence Alignment(MSA)plays a key role in the analysis of biological sequences and structure,function,evolution,and other fundamental areas of bioinformatics.With the substantial increase in the scale of next-generation bio sequencing,the existing multiple sequence alignment methods have shown significant performance bottlenecks and even powerlessness under large-scale data.Addressing in this problem,this paper proposes a series of algorithms to accelerate the multiple sequence alignment processes on the HDFS storage system and the Spark framework.This paper contains the following parts:The Smith-Waterman algorithm for large-scale protein multiple sequence alignment and the suffix tree suffix array algorithm for large-scale nucleic acid multiple sequence alignment were implemented.The Smith-Waterman algorithm is a local optimal alignment algorithm that uses the dynamic programming idea to calculate a scoring matrix.The comparison result is of high quality but memory-consuming.The suffix tree suffix array algorithm has O(? ? ????)time efficiency,the alignment quality is reliable,and the space complexity can be optimized to O(?)by prefix doubling algorithm.The space complexities of this algorithm and the Smith-Waterman algorithm can be further reduced by the HDFS distributed storage system,with stable high availability.In-depth parallel optimization in Spark cluster environment,we need to give full play to its advantages of elastic loading,memory sharing and distributed storage.In terms of load balancing,the "large variable" broadcast is adjusted by an adaptive algorithm,and the number of blocks in the RDD operator is adjusted to improve network throughput performance.In terms of memory optimization,optimizing data structures and serializing objects adjusts memory reclamation and cache size.In terms of distributed storage,HDFS is used to store large-scale sequences to improve disaster tolerance.In terms of engineering implementation,it adopts object-oriented design and light coupling package,which is conducive to future maintenance and algorithm expansion.A series of protein and nucleic acid sequences with different scales in singlemachine multi-threaded environment and cluster environment are compared horizontally,and some state-of-the-art multiple sequence alignment software such as MUSCLE are compared with the tool of the present invention.The results show that the large-scale sequence alignment algorithm based on Spark computing platform has more outstanding performance than other algorithms in terms of time efficiency,memory efficiency,speedup ratio and quality of results,which proves the important value of this work.Finally,our tool connects to the high-performance distributed cluster and it has been deployed on the website for free access by researchers.
Keywords/Search Tags:Spark, Smith-Waterman, Suffix Tree, Suffix Array, Multiple Sequence Alignment
PDF Full Text Request
Related items