The longest common subsequence(LCS)between sequences can be used in data mining and pattern recognition as a measure of similarity between sequences.The early research was mainly based on dynamic programming,which was used to solve the longest common subsequence problem of two sequences.However,with the increase of data size,the number of sequences that need to be compared at the same time is not limited to two.We call the longest common subsequence problem of three or more sequences the multiple longest common subsequence problem(MLCS problem),and the MLCS problem has been proved to be a NP-hard problem.When solving the MLCS problem,current algorithms will transform the problem into finding the longest path in the directed acyclic graph(DAG).Their most serious drawback is that in the face of large-scale MLCS problems,the time and space costs of constructing directed acyclic graph are very large,so it is difficult to solve the problem in a short time and limited memory space.The algorithm proposed earlier is based on matching points.This algorithm builds DAG layer by layer,and after each layer is built,it deletes the nodes by dominant sort technique.However,in the face of large-scale MLCS problems,the number of nodes in each layer increases exponentially.Each layer will store a large number of the same nodes(redundant nodes),and the dominant sort technique will consume a lot of computing resources.The scale of the problem solved by the algorithm is very limited.A recent algorithm proposed a non-redundant directed acyclic graph(NCSG)model.This algorithm avoids the redundant nodes in the graph by using the hash table data structure.However,after the DAG is constructed,the algorithm needs to obtain the final LCS through two topological sorting,and the hash table will occupy a lot of space,so the algorithm is still difficult to efficiently solve large-scale MLCS problems.In addition,the existing algorithms lack the strategy to reduce the search space,so a large number of non-longest paths are saved in DAG,resulting in a waste of computing resources and storage space.In view of the shortcomings of the current algorithm,this paper designs more effective models and strategies,which significantly reduces the time and space costs of the algorithm in solving large-scale MLCS problems.The innovation of this article is:(1)The current algorithm will save all matching points when building the NCSG through the hash table technology,which will consume a lot of memory space in the search process.And after building the NCSG,the algorithm needs twice topological sorting to solve the longest path in the graph,which increases the time cost.Therefore,we propose a new VHT/DGMLCS algorithm,which uses: 1)A real-time matching point cleaning strategy,which can continuously reduces the number of matching points in the hash table during the search process,thus greatly reducing the space cost of the algorithm;2)A new dynamic graph(DG)model.The DG model uses a simple data structure.When building the DG model,the level of a node will be continuously updated,and only the longest path through a node will be recorded.After the DG model is built,all LCS can be directly obtained by backtracking once,thus avoiding the time of two topological sorting.(2)Aiming at the lack of strategies to reduce the search space in existing algorithms,the BigMLCS algorithm based on the branch-and-bound method is proposed.The algorithm will look for a real path as long as possible as the lower bound of the LCS’s length at initialization phase,and then estimate the upper bound of the length of the path through any node when constructing the DAG.If the upper bound is smaller than the lower bound,any path through the node would not be the longest path,hence the algorithm would not continue to expand the node,reducing the search space eventually.Meanwhile,the branches which cannot form the longest path are removed,thus reducing the size of the DAG and saving significant amount of memory resources.(3)At present,algorithms using branch-and-bound strategy are not efficient enough in estimating the lower and upper bounds,which is manifested as: 1)the estimation method of the upper bound has a large time overhead;2)The estimation of the lower bound is only done once during initialization,which is not precise enough.However,the estimation of the upper and lower bounds is the core of branch-and-bound strategy and directly affects the scale of the DAG that the algorithm ultimately constructs.Besides,the current algorithm does not make reasonable use of storage resources when storing data,and all data is stored in memory,which limits the algorithm’s ability to solve large-scale problems.Therefore,we propose an efficient branch-and-bound algorithm based on distributed storage: Step-MLCS algorithm.The algorithm uses a short time upper bound estimation method and an adaptive lower bound update strategy,which improves the efficiency of branch-and-bound strategy.In addition,the algorithm uses distributed storage to store the built DAG on an external storage device and only keep key information in memory,thus reducing the memory overhead and improving the ability of the algorithm to solve large-scale problems. |