Font Size: a A A

Research On Models And Algorithms For Several Key Problems In Sequence Mining

Posted on:2018-08-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z PengFull Text:PDF
GTID:1368330542473058Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In computer science and many other fields,sequence(string)is one of the most fundamen-tal and important data types,and there are many subjects such as text,source code,and DNA/protein of biological,etc,can be represented by sequences and then processed by com-puter.Therefore,researching the theory and algorithm about sequence is fundamentally important.In this thesis,we focus on three critical problems in sequence mining,i.e.,multi-ple pattern(string)matching problem,suffix sorting problem,and multiple longest common subsequences problem.The multiple pattern(string)matching problem is widely used in text editing and information security;the suffix sorting is the key technique of full text index and data compression;the MLCS problem is mainly used in bioinformatics,and it is the key technique of multiple sequence alignment.The main contributions of this thesis are as follows:(1)For the multiple pattern(string)matching problem,to improve the poor robustness and scalability of the existing methods,a fast matching engine is proposed.The engine includes a filter module and a verification module.The filter module is based on several bitmaps which are response for quickly filtering out the invalid positions in the text,while for each potential matched position,the verification module confirms true pattern occurrence.In particular,we design a compact data structure called Adaptive Matching Tree(AMT)for the verification module,in which each tree node only saves some pattern fragments of the whole pattern set and the inner structure of each tree node is chosen adaptively according to the features of the corresponding pattern fragments.So no matter what the input pattern sets are,the AMT can adjust to the optimal form,thus,achieves a good robustness and scalability.Additionally,an improved multiple pattern matching algorithm is proposed based on the widely used Wu-Manber(WM)algorithm.The proposed algorithm improves the Wu-Manber algorithm in two aspects:firstly,by dynamically selecting the signature of each pattern,the lengths of lists in the HASH table are balanced to reduce the number of candidate patterns;Secondly,a data structure called the“INDEX table”based on binary search is designed to reduce the time for finding candidate patterns.(2)For the suffix sorting problem,a practical suffix sort algorithm called dsufsort is proposed based on the widely used qsufsort algorithm.The qsufsort suffers one critical limitation that it sorts the suffix according to a fixed number of characters in each round,specifically,the suffixes are sorted based on their first 2~kcharacters in the k-th round,so the order of suffixes starting with the same 2~kcharacters can not be determined in the k-th round.To overcome this drawback,the dsufsort algorithm maintains the depth of each unsorted portion of SA,and sorts the suffixes based on the depth in each round.By this means,some suffixes that can not be sorted by qsufsort in each round can be sorted now,as a result,more sorting results in current round can be utilized by the latter rounds and the total number of sorting rounds will be reduced.Moreover,due to the accumulation of depth,the dsufsort algorithm can quickly sort the suffixes with long common prefix.(3)For the multiple longest common subsequence(MLCS)problem,the time and space efficiency of the existing dominated point graph based approaches is unsatisfactory:the dominated point graph constructed by these algorithms consume a huge amount of memory and time during processing,since they need to keep all the generated nodes and search the graph for longest paths.To address this problem,a new time and space efficient graph model called the Leveled-DAG for MLCS problems is proposed.During processing,the Leveled-DAG model can timely construct the MLCS of input sequences and eliminate all nodes in the DAG that can not contribute to the construction of MLCS.At any moment,only the current level and some previously generated nodes with incoming edges in the graph need to be kept in the memory.Also,the number of nodes in the graph will gradually decrease,and the final graph contains only one node in which all of the MLCS are saved,thus,no further operations for searching the MLCS are needed.The experimental results show that the time and space efficiency of the Leveled-DAG approach are better than the existing approaches.
Keywords/Search Tags:Multiple Pattern Matching, Wu-Manber Algorithm, Suffix Sort, Multiple Longest Common Subsequence
PDF Full Text Request
Related items