Font Size: a A A

Efficient Approximate String Matching With Edit Distance Constraint

Posted on:2017-04-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:J Y WangFull Text:PDF
GTID:1318330542477139Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Approximate string matching problem widely exists in many fields such as text retrieval,pattern recognition,signal processing,and bioinformatics.Edit distance is the most common metric of string similarity.To address the approximate string matching problem,many approaches have been proposed.However,with the rapid growth of web data,log data,and bio-sequences,rapid development of computer system,and continuously rising of new issues,existing approximate string matching approaches face enormous challenges.Researching on efficient approximate string matching is important in theory and practice.This dissertation deeply studies several typical problems of approximate string matching,and proposes several efficient indexes and matching methods according to different problems.The contributions of this dissertation can be summarized as follows:(1)Existing methods are highly sensitive to query strings or query parameters.To address the problem,we propose an adaptive approximate string matching approach.The method achieves one order of magnitude faster than existing methods.Firstly,using BWT index and even partition scheme,we develop a basic approximate string matching method.Secondly,a best partition scheme is proposed to select the high quality of the candidate set.It can find the minimum candidate set,which effectively reduce the verification cost.Furthermore,through theoretical analysis,we find that the best partition scheme cannot always outperform the even partition scheme.Thus we propose an adaptive approach for selectively choosing scheme using statistic knowledge.(2)Existing methods all focus on global matching between two strings.To address the problem,we propose a local similarity join approach.The method fills the gap in string local similarity join field.Firstly,if two strings are locally similar to each other,they should share at least one common gram of a certain length.Based on which,we develop a local similarity join framework.Secondly,local similarity verification based on a matching gram pair is not trivial.Existing extension-based method,which conducts exact extension in one string and similar extension in another one,cannot solve the problem.To solve the problem,a skyline based local similarity verification method is proposed.Further,in the join process,there are many matching grams,some of which could not extend to the final results,whereas others could cause duplicate verifications.To solve the problem,several filtering techniques are proposed to carefully choose matching gram pairs.Finally,we extend local similarity checking method to do local similarity locating.(3)Most existing methods only work on single core.To address the problem,we propose a multi-core parallel approximate string matching approach.The method achieves one or two order of magnitude faster than existing methods.Firstly,we propose a multi-core parallel query framework to boost the approximate string matching.It dynamically adjusts the computing resources according to the specific query.Secondly,we improve the traditional BWT backward searching method,and devise efficient filtering and verification methods.We segment and reassemble queries to balance the task assignments,and share computation to avoid redundant computing.Finally,a parallel BWT index method is proposed to do efficient string matching under the environment with limited memory.(4)In order to further improve the scalability of query on bio-sequences,we propose an efficient compressed genomic data oriented query approach.Compared to existing methods,it can support one or two order of magnitude larger biological datasets.We propose an efficient compressed genomic data oriented query approach.Firstly,based on the characteristics that different genomic sequences are highly similar with each other,a compressed genomic data oriented index method is proposed.Using the index,we can find all the matching results by checking a small fraction of the matching grams without decompressing the sequences.Secondly,several improved index methods are proposed to reduce the space cost but still guarantee the correctness.Finally,a distributed index method based on compressed genome data is proposed to effectively balance the space cost among nodes.The method has good scalability.
Keywords/Search Tags:approximate string matching, edit distance, multi-core parallel, adaptive approach, local similarity
PDF Full Text Request
Related items