Font Size: a A A

Parallel Algorithm For Multiple Longest Common Subsequence And Application Research On Hadoop Platform

Posted on:2016-06-24Degree:MasterType:Thesis
Country:ChinaCandidate:W Z ZongFull Text:PDF
GTID:2348330485999988Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Calculating the longest common subsequence has important applications in the field of bioinformatics, information retrieval, information content matching and other fields. Calculating the longest common subsequence is an optimization problem, quite time-consuming. When the Sequence Set of Multiple Sequence Data is big, how to find the longest common subsequence quickly and accurately is an important research topic. In this paper, it will present the design and implementation of a parallel algorithm of multi-longest common subsequence based on Hadoop platform. And this algorithm is applied to the text similarity calculation.On the Hadoop-based distributed parallel architecture platform, By dividing the sequence data, designing of multi-sequence Data stored and transferred in HDFS, reducing the number of Mapper?Reducer to reduce data transfer in parallel computing process. So that each node in the calculation of finding the longest common subsequence process load balancing. Using MapReduce programming model, we designed and implemented multiple sequence longest common subsequence parallel algorithms. Experimental results show that the parallel algorithm proposed in this paper can solve the multiple longest common subsequence problem effectively, and the acceleration effect is obvious.On the Hadoop distributed parallel architecture platform, Multi longest common subsequence algorithm is applied in parallel to solve the similarity problem of text. Text is segmented by chapter and then be assigned to different Map task to reduce the transmission of text data. MapReduce function and the method of sequence allocation are designed to assign text data set to each compute node to parallel compute text similarity. By using MapReduce sorting mechanism, Top-k sub-texts are output according to the degree of similarity. Text classification can be realized. Experimental result shows that compared with the algorithm of computing text similarity by computing edit distances, the parallel algorithm of text similarity is efficient and with a high accuracy of text classification.
Keywords/Search Tags:Multi Sequence, Common Subsequence, Text Similarity, Parallel Computing, Hadoop
PDF Full Text Request
Related items