Parallel Algorithm For Multiple Longest Common Subsequence And Application Research On Hadoop Platform

Posted on:2016-06-24

Degree:Master

Type:Thesis

Country:China

Candidate:W Z Zong

Full Text:PDF

GTID:2348330485999988

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Calculating the longest common subsequence has important applications in the field of bioinformatics, information retrieval, information content matching and other fields. Calculating the longest common subsequence is an optimization problem, quite time-consuming. When the Sequence Set of Multiple Sequence Data is big, how to find the longest common subsequence quickly and accurately is an important research topic. In this paper, it will present the design and implementation of a parallel algorithm of multi-longest common subsequence based on Hadoop platform. And this algorithm is applied to the text similarity calculation.On the Hadoop-based distributed parallel architecture platform, By dividing the sequence data, designing of multi-sequence Data stored and transferred in HDFS, reducing the number of Mapper?Reducer to reduce data transfer in parallel computing process. So that each node in the calculation of finding the longest common subsequence process load balancing. Using MapReduce programming model, we designed and implemented multiple sequence longest common subsequence parallel algorithms. Experimental results show that the parallel algorithm proposed in this paper can solve the multiple longest common subsequence problem effectively, and the acceleration effect is obvious.On the Hadoop distributed parallel architecture platform, Multi longest common subsequence algorithm is applied in parallel to solve the similarity problem of text. Text is segmented by chapter and then be assigned to different Map task to reduce the transmission of text data. MapReduce function and the method of sequence allocation are designed to assign text data set to each compute node to parallel compute text similarity. By using MapReduce sorting mechanism, Top-k sub-texts are output according to the degree of similarity. Text classification can be realized. Experimental result shows that compared with the algorithm of computing text similarity by computing edit distances, the parallel algorithm of text similarity is efficient and with a high accuracy of text classification.

Keywords/Search Tags:

Multi Sequence, Common Subsequence, Text Similarity, Parallel Computing, Hadoop

PDF Full Text Request

Related items

1	The Research On Algorithms For The Longest Common Subsequence Problem And Variants
2	Study On Parallel Algorithms For Longest Common Subsequence On Heterogeneous Cluster Computing Systems
3	Approximate Longest Common Subsequence Query Processing And Optimization On Biological Sequence
4	The Research On The Longest Common Subsequence Query Algorithm
5	Research On Chinese Person Name Disambiguation Algorithm
6	Research On Similar Path From Software Execution Traces
7	Sequence Inherent In The Model Theory And Application
8	Research On Parallelization Of Text Clustering Based On Hadoop Cloud Computing Platform
9	The Key Technologies Research Of Web Text Mining Based On Hadoop
10	Explorations on the longest common increasing subsequence problem