Research On Modeling Methods For A Kind Of Key Problems In Sequence Mining

Posted on:2021-12-23

Degree:Master

Type:Thesis

Country:China

Candidate:S Liu

Full Text:PDF

GTID:2518306050471964

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Searching all the longest common subsequences of multiple character sequences is called a Multiple Longest Common Subsequences(MLCS)problem.MLCS problem is a key problem in sequence mining which has a wide range applications in data mining,bioinformatics,pattern recognition and so on,and is a very challenging problem when the length and number of sequences become large.There has not been an effective exact algorithm(i.e.,the algorithm can find all longest common subsequences)until now for this problem with long and large number of sequences.The time and space complexity of the current best performance algorithms(dominant point based algorithms)is exponentially related to the length of the sequence.We analyze that the disadvantage of dominant point based algorithm are: 1)the too high time consuming on non-dominated sorting;2)the too big of directed acyclic graph(DAG).In order to overcome these defects,the research work in this paper is as follows:(1)By considering the characteristics of DNA sequences with many consecutively repeated characters,we first design a character merging scheme which merges the consecutively repeated characters in the sequences.As a result,it shortens the length of sequences considered,saves the space of storing all sequences and reduces the cost of searching the precursor points.To further reduce the space and time costs,we construct a Weighted Directed Acyclic Graph(WDAG)which is much smaller than widely used DAG for MLCS problems.Based on these techniques,we propose a fast and memory efficient algorithm for MLCS problems.Finally,we analyze the performance of the proposed algorithm and carry out experiment.Compared with several state-of-the-art algorithms,the experimental results show that the proposed algorithm performs better than the compared state-of-the-art algorithms in both time and space costs.Thus,the proposed algorithm is a fast and memory efficient MLCS algorithm for DNA sequences alignment.(2)We propose a new effective exact algorithm for solving large scale MLCS problems including three new strategies: 1)We design a new direct acyclic graph(DAG)construction method which can construct a new DAG called REG.REG is much smaller than DAG constructed by the existing algorithms.Thus,the proposed algorithm can solve more larger scale problems.2)We introduce a new concept called direct successor which can make the REG timely delete non-critical points during the construction of the REG and greatly reduce time and space overhead.3)We design a new data structure called Precursor Table which can make the search of solutions more efficiently and make the proposed algorithm use much fewer time.The time and space complexity of the proposed algorithm is linearly,and thus is very suitable to large scale MLCS problems.Comprehensive experiments on the widely used biological sequence benchmark data sets show that the proposed algorithm outperforms the compared state-of-the-art dominant point based algorithms.

Keywords/Search Tags:

Data Mining, Sequences Alignment, Longest Common Subsequences, Directed Acyclic Graph, Dominant Point Based Algorithm, Character Merging, Direct Successor

PDF Full Text Request

Related items

1	Graph Models And Algorithms Of The Longest Common Sub-sequences For Many Long Sequences
2	Research On Efficient Algorithms For Large-scale MLCS Problems
3	Two MLCS Branch-and-Bound Algorithms Based On Memory Saving Or Fast Construction Of Graph
4	The Research On Algorithms For The Longest Common Subsequence Problem And Variants
5	Research On The Key Techniques Of Graph Similarity Measure
6	Research On Key Technologies Of Textual Proofreading On Government Websites
7	A Research On Finding The Conjunct Motif Of Series
8	Data Streams Clustering Algorithms Research Based On Spatial Directed Graph And Sequences
9	Approximate Longest Common Substring Matching And Optimization Techniques With Edit Distance Constraint
10	The Study On Biclustering Algorithms For Analysis Of Gene Expression Data