Research On Modeling Methods For A Kind Of Key Problems In Sequence Mining | Posted on:2021-12-23 | Degree:Master | Type:Thesis | Country:China | Candidate:S Liu | Full Text:PDF | GTID:2518306050471964 | Subject:Computer Science and Technology | Abstract/Summary: | PDF Full Text Request | Searching all the longest common subsequences of multiple character sequences is called a Multiple Longest Common Subsequences(MLCS)problem.MLCS problem is a key problem in sequence mining which has a wide range applications in data mining,bioinformatics,pattern recognition and so on,and is a very challenging problem when the length and number of sequences become large.There has not been an effective exact algorithm(i.e.,the algorithm can find all longest common subsequences)until now for this problem with long and large number of sequences.The time and space complexity of the current best performance algorithms(dominant point based algorithms)is exponentially related to the length of the sequence.We analyze that the disadvantage of dominant point based algorithm are: 1)the too high time consuming on non-dominated sorting;2)the too big of directed acyclic graph(DAG).In order to overcome these defects,the research work in this paper is as follows:(1)By considering the characteristics of DNA sequences with many consecutively repeated characters,we first design a character merging scheme which merges the consecutively repeated characters in the sequences.As a result,it shortens the length of sequences considered,saves the space of storing all sequences and reduces the cost of searching the precursor points.To further reduce the space and time costs,we construct a Weighted Directed Acyclic Graph(WDAG)which is much smaller than widely used DAG for MLCS problems.Based on these techniques,we propose a fast and memory efficient algorithm for MLCS problems.Finally,we analyze the performance of the proposed algorithm and carry out experiment.Compared with several state-of-the-art algorithms,the experimental results show that the proposed algorithm performs better than the compared state-of-the-art algorithms in both time and space costs.Thus,the proposed algorithm is a fast and memory efficient MLCS algorithm for DNA sequences alignment.(2)We propose a new effective exact algorithm for solving large scale MLCS problems including three new strategies: 1)We design a new direct acyclic graph(DAG)construction method which can construct a new DAG called REG.REG is much smaller than DAG constructed by the existing algorithms.Thus,the proposed algorithm can solve more larger scale problems.2)We introduce a new concept called direct successor which can make the REG timely delete non-critical points during the construction of the REG and greatly reduce time and space overhead.3)We design a new data structure called Precursor Table which can make the search of solutions more efficiently and make the proposed algorithm use much fewer time.The time and space complexity of the proposed algorithm is linearly,and thus is very suitable to large scale MLCS problems.Comprehensive experiments on the widely used biological sequence benchmark data sets show that the proposed algorithm outperforms the compared state-of-the-art dominant point based algorithms. | Keywords/Search Tags: | Data Mining, Sequences Alignment, Longest Common Subsequences, Directed Acyclic Graph, Dominant Point Based Algorithm, Character Merging, Direct Successor | PDF Full Text Request | Related items |
| |
|