Font Size: a A A

Research And Implementation Of Distributed And Parallel Algorithm For Large-Scale Generalized Suffix Tree Construction

Posted on:2018-04-15Degree:MasterType:Thesis
Country:ChinaCandidate:C GuoFull Text:PDF
GTID:2428330512998205Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Generalized suffix tree is a data structure for indexing strings.It is not only essential for efficiently solving many string-related algorithmic problems,such as substring matching,finding longest common substrings,etc,but also widely used in a variety of applications such as bioinformatics,code clone detection,language model,sequential pattern mining,and so on.However,generalized suffix tree is hard to build efficiently because its size is much larger than the input strings,and the construction procedure suffers from intensive computation and high I/O cost.Such problems become especially severe in big data era.These facts above have made generalized suffix tree construction problem a hot topic among researchers.In the past few decades,many suffix tree construction algorithms have been proposed.But most existing algorithms have three main drawbacks.First,they don't construct the generalized suffix tree directly.Instead,they build a suffix tree-a special case of generalized suffix tree-then transform it to the generalized suffix tree.This approach takes less consideration of the unique features of generalized suffix tree construction problem,such as the diverse form of input strings.Second,although great progress have been made during the past few decades,there are still several challenges to overcome such as I/O optimization,data scalability and performance stability.Therefore they can't handle today's large-scale input.Third,many algorithms are designed for single-machine or supercomputer.In big data senerios,single-node based algorithms are unable to handle large-scale input while supercomputer-based algorithms are usually too expensive for many users.In the big data era,cluster-based parallel computing platforms have already become the mainstream solution for many large-scale problems since they are easy to build,use,maintain and extend.But there is still a lack of an efficient and scalable algorithm for laege-scale generalized suffix tree construction,and it is challenging to design such an algorithm.On the one hand,the algorithm should be designed for large-scale input and distributed scenarios to take advantage of distributed storage and parallel computing.On the other hand,the algorithm should have good flexibility that can efficiently process different forms of input(such as massive short strings,few very long strings,etc),and have good performance and scalability.This paper presents LMdst,a parallel generalized suffix tree construction algorithm designed for distributed scenarios which can handle input in different forms and scale.The following are the primary contributions of this paper:(1)We propose a novel algorithm LMdst,an efficient scalable distributed parallel generalized suffix tree construction algorithm based on Lcp-range multiway merging.(2)We propose a parallel subtree partitioning method based on trie,which can efficiently count frequency of subsequences in order to partition the whole generalized suffix tree construction task.Furthermore,we design an input-segment-based data partitioning mechanism that can guarantee similar performance with different forms of input.(3)We propose an efficient sequence compare method based on Lcp-range structure,and a fast parallel sequence sorting method based on Lcp-range multiway merging.Based on these two methods,we propose a batched suffix sub-tree construction algorithm optimizing both computation and I/O.(4)We propose an efficient suffix sub-tree construction task assignment method by formulating it as a mix of bin-packing and number-partitioning problems,thus can not only fully exploit each computing unit,but also achieve good load balance between computing units.(5)We analyse computation complexity and I/O cost of LMdst,then implement LMdst on Apache Spark,and finally evaluate it from performance,scalability and flexibility aspects.Results show that LMdst achieves good performance,quasi-linear scalability and good flexibility.Because of its excellent performance,LMdst has won first prize in programming contest section in the third "National Contest on Application and Innovation of Cloud Computing" hosted by Ministry of Education,Science and Technology Development Center in April 2017.
Keywords/Search Tags:Big Data, Generalized Suffix Tree, Distributed and Parallel Algorithm, Multiway Merging
PDF Full Text Request
Related items