Research On Efficient Algorithms For Large-scale MLCS Problems

Posted on:2023-05-02

Degree:Doctor

Type:Dissertation

Country:China

Candidate:S W Wei

Full Text:PDF

GTID:1528306911480974

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

A large amount of information in the real world can be represented in the form of character sequences.Searching the longest common subsequences from multiple character sequences（MLCS）,as a specific technique of data mining,is usually used to analyse the intrinsic re-lationships among them or to discovery useful knowledge from them.The MLCS problems have a wide range of applications,such as in bioinformatics,pattern recognition,document comparison,and information retrieval,to name a few.However,tackling MLCS problem is very challenging,the existing algorithms are not efficient and can only deal with small-scale MLCS problems.With the increase of the scale of the MLCS problems,the running time and memory space consumed by these algorithms will increase rapidly or even ex-ponentially,resulting in that they can not find the final results within the limited time and memory space.The fundamental reasons that affect the efficiency of algorithms are:1)The search space is large,and a large amount of repetitive and useless information needs to be stored and calculated during the execution of the algorithm,resulting in excessive time and space consumption;2)The non-dominated sorting technique for searching the longest path（corresponding to MLCS）is very time consuming,while the topological sorting technology usually requires much more memory space;3)For the graph model based algorithms,there are a great number of useless match points（that is,they contributes nothing for finding the longest paths from the graph）and the non-longest paths passing through them,and the exist-ing technologies cannot identify them in time and delete them for saving running time and memory space.This thesis investigates the fundamental reasons that affect the inefficiency of algorithms,and designs several novel algorithms in order to deal with larger scale MLCS efficiently.The main work and innovations of this thesis are summarized as follows:（1）Aiming at the problems of existing algorithms that a lot of nodes may appear many times in the graph and the non-longest paths cannot be deleted dynamically during the construction of the Directed Acyclic Graph（DAG）,a Path Recorder Directed Acyclic Graph（PRDAG）model is designed and a called Path Recorder Algorithm based on this model（PRA）is pro-posed.When constructing PRDAG,a repeated node checking mechanism is designed,which can quickly detect whether a node has existed in the graph by using hash technology.If a node has been in the graph,it will no longer be put into the graph twice to ensure that there are no duplicate or redundant match points in the DAG.In addition,all nodes in the graph only record their key precursor nodes.Once the node level value is updated,the PRA algorithm will delete the node’s previous predecessor node（now it has become a non-key precursor node）in time,which is equivalent to removing the non-longest path information,so the size（scale）of the DAG becomes smaller,and accordingly a lot of memory space and running time are saved.（2）In order to solve the problem of excessive memory consumption and the problem of ex-cessive time consumption by the algorithms based on non-dominated sorting technology,a Hierarchical Propulsion Directed Acyclic Graph model（HP-DAG）is developed,and an effi-cient algorithm EAHPM based on this the model is proposed.The new algorithm transforms finding MLCS from sequences into searching the longest path in a DAG.A match point dy-namic deletion technology is designed to delete outdated match points in time（these match points will no longer be used in the subsequent construction of DAG）,and at the same time,combined with the longest distance estimation technology,we check out those unreachable match points to reduce the number of match points in the DAG.The total number of match points in the DAG is reduced,so the time and memory space consumed by the new algorithm will be greatly saved accordingly.（3）Aiming at the problem that the scale of the constructed DAG is large and the useless match points in the graph cannot be identified in time,a branch elimination-based algorithm BEST-MLCS,which is very efficient in both space and time consumption,is proposed.The BEST-MLCS algorithm designs a completely new useless match point judging technique and branch elimination strategy based on upper and lower bound estimation,that is,in the pre-processing stage of the algorithm,a lower bound L_mlcsof the length of MLCS is quickly estimated by a heuristic method,then a special method is designed which can quickly cal-culate an upper bound U_{（O,p,∞）}of all paths passing through the match point p.If the upper bound U_{（O,p,∞）}is less than the lower bound L_mlcs,it can be inferred that the match point p must be a useless match point,and all paths passing through the match point are the non-longest paths.During the construction of DAG,we can delete these useless match points and the branch paths passing through them in time（i.e.,the branch elimination strategy）,thus the scale of the constructed DAG is very small,which can greatly save running time and memory space.（4）A new algorithm called Mema-MLCS based on the minimized graph model is proposed to solve the problem that the search space is too large and the number of nodes in the graph is too huge.The algorithm Mema-MLCS designs a scheme,which can quickly calculate the span value of match points,and identifies useless match points by comparing the span value of the node with the estimated lower bound of the length of MLCS,thus,it can delete useless match points and the no-longest paths passing through them timely,resulting in that a much smaller DAG is generated.The Mema-MLCS algorithm consumes much less time and memory space to obtain the final results than the other algorithms,so that it can handle much larger scale MLCS problem.

Keywords/Search Tags:

Multiple Longest Common Subsequences(MLCS), Dynamic Programming(DP), Dominance-based method, Useless match point identification, Branch elimination, Directed Acyclic Graph(DAG)

PDF Full Text Request

Related items

1	Two MLCS Branch-and-Bound Algorithms Based On Memory Saving Or Fast Construction Of Graph
2	Graph Models And Algorithms Of The Longest Common Sub-sequences For Many Long Sequences
3	Research On Modeling Methods For A Kind Of Key Problems In Sequence Mining
4	Research On The Key Techniques Of Graph Similarity Measure
5	Approximate Longest Common Substring Matching And Optimization Techniques With Edit Distance Constraint
6	A Research On Finding The Conjunct Motif Of Series
7	Research On Causal Discovery Method Based On Reinforcement Learning
8	Hierarchical Graph Model Of Visual Programming Components And Its Application
9	The Research On Algorithms For The Longest Common Subsequence Problem And Variants
10	Research On Scheduling Algorithm For Parallel Job Modeled By Directed Acyclic Graph