Font Size: a A A

The Research On The Longest Common Subsequence Query Algorithm

Posted on:2019-12-15Degree:MasterType:Thesis
Country:ChinaCandidate:L S ZhaiFull Text:PDF
GTID:2428330566989122Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The longest common subsequence problem is a classic problem in the field of computer science,which is used to return the longest common subsequence of multiple sequences,called MLCS.MLCS has important applications in gene detection,sequence similarity comparison,pattern recognition,data mining,code clone detection,document duplication and web clustering.At present,the algorithm based on dynamic programming and the algorithm based on the dominating point are usually used to solve the MLCS problem.With the increase in number and length of character sequences,the existing algorithms can not meet the actual requirements from the perspective of efficiency and scalability.This paper studies the MLCS problem,and the research content is as follows.Firstly,a new algorithm based on single hash is proposed,which is the longest common subsequence algorithm,SF_MLCS,which first generates successive table of each sequence according to the successive relationship between characters,and then represents successive table of all sequences using directed acyclic graph.On this basis,the subsequent relationship between characters is quickly accessed through hash.Compared with existing double hashing methods,the proposed single hash method improves the efficiency while reducing the memory space.through the deeply analysis of the existing algorithms,we find the problem of low efficiency and poor expansibility of the existing algorithms.Secondly,the longest common subsequence algorithm based on divide and conquer strategy,DC_MLCS algorithm,is proposed to solve the problem that the single hash method has high spatial cost and can not deal with long sequence.Representation of directed acyclic graph all subsequent sequence tables based on DC_MLCS algorithm firstly generates k layer vertex subgraph based on the subgraph,then each vertex with zero degree of output is processed one by one.in the treatment of each vertex,a vertex is deleted before treatment in order to reduce the space occupied space,low cost the system can deal with longer sequences.on this basis,the concept of the longest common subsequence of top-k is proposed,and the corresponding algorithm is designed and implemented to solve the longest common subsequence of all k sequences in a given n sequence.Finally,Experiments are carried out based on real data sets to verify the efficiency of the algorithm.
Keywords/Search Tags:The longest common subsequence, divide and conquer, single Hash, top-k
PDF Full Text Request
Related items