Font Size: a A A

Research On Similarity Query Over Sequence Data

Posted on:2010-06-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:D B DaiFull Text:PDF
GTID:1118360302979298Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Sequence data is an important and special data type,ubiquitous in many domains such as text,web access log,users' purchase sequences in transaction databases,as well as DNA and protein sequences in biological databases.Intuitively, a sequence is an ordered list of pairs(value,order).Different from the traditional set data,different elements in sequence data bear distinct spatial or time ordering. Element values together with their ordering are indispensible for analyzing and mining various sequence data.String is a common sequence data with spatial ordering.Analyzing and mining on various string data are always common focused issues both in academic and industrial communities.Recently,with rapid increase of various biological sequences in bioinformatics,string databases present two distinct characteristics:the average length of the strings is increasing and the size of these databases is constantly augmented.In addition,with the rapid development of Web technologies and dramatic increase of Internet users,many applications such as search engines based on keyword queries,spell checkers in e-mail systems and duplicate detection in data cleaning and so on,pose a great challenge in the efficient similarity queries on sequence data.Similarity join between two sets of strings returns pairs of strings from each set such that similarity values between the pairs are above a given threshold,which is an extension of similarity query,with wide applications in data cleaning,plagiarism identification,bioinformatics,etc..As important techniques of data analysis and mining,the research on similarity query and similarity joins for string data are always active so far.A core issue in similarity query and similarity joins for string data involves feature extraction in strings,definition and efficient computation of similarity measures.They have been hard research problems due to difficulty in extracting and effectively exhibiting the features of strings and prohibitive time-cost computation of similarity measure.Most existing algorithms on similarity query only use string-feature based filters for speed-up,and totally ignore the effect of filtering order on efficiency.Furthermore,for the coming queries,almost all previous query results are discarded without being used for speeding up the current queries,and edit distances are always directly evaluated to perform post-processing without any optimizations.For similarity joins,most current methods only develop optimizations on static datasets,which are not suitable for highly-dynamic scenarios.Aiming at tackling the above problems,we systematically study the issues of similarity query and similarity joins for string data in this thesis,and make three contributions as follows:(1).A query algorithm SSQ_MF based on optimized multiple filters is proposed.For string data,their own characteristics and metric properties are two kind of important features,which reflect intrinsic property of strings("Internal features") and relationship between different strings("External features") from different aspects.However,existing algorithms are only based on "Internal features" or "External features" to perform filtering and never combine them to further improve the performance.Also,these algorithms do not analyze the effect of filtering order of the multiple filters in use on performance.Algorithm SSQ_MF combines "Internal features" filters and "External features" judiciously,and suggest an optimal model on filtering order of the combined filters.As a result,the overall filtering power is improved and correspondingly the filtering cost is greatly reduced.Comprehensive experimental results demonstrate that SSQ_MF is more efficient than the algorithms without combined filters and those leveraging multiple filters but executing in random orders.(2).An improved query algorithm IRI based on reference-indexing is designed.On the basis of existing reference-index technique,IRi algorithm fully exploits abundant information hidden in the previous query results to speed up filtering and obtain a lower bound on the number of stored previous result that can guarantee the improvement of query performance.In addition,strings' own characteristics are added to tighten the upper and lower bound and hence result in improved filtering power.During the phase of post-processing,IRI only computes a partial dynamic programming table to enhance the efficiency.Experimental results show that IRI algorithm is superior to the existing methods RI in terms of query efficiency on real datasets including NDA and protein sequences.(3).An efficient similarity joins SJ-DASS on large string databases against incremental updates is introduced.To tackle the inability to efficiently perform similarity joins on dynamically-augmented string sets incrementally,an efficient algorithm called SJ-DASS is proposed.The model based on dynamically-augmented string sets is well reflecting the need of real world.By combining the metric properties with strings' own characteristics,we design an distance-based indexing structure that can be updated incrementally,and propose two tighter lower bound on edit distance on the basis of existing filters to improve the filtering power.On dynamically-augmented datasets,SJ-DASS outperforms the existing algorithms in terms of running time and also achieves great space reduction.In a word,two issues related to string similarity:similarity query and similarity joins for string data are well studied in this thesis and efficient algorithms to handle these issues are proposed respectively.Two proposed algorithms:IRI and SSQ_MF achieve effective improvement over the existing methods,and SJ-DASS algorithm effectively extends similarity joins on static string sets to a general case: dynamically-augmented datasets.Theoretical analysis shows our approaches are effective and efficient,and extensive comparing experiments demonstrate that our proposed algorithms are superior in terms of storage cost and processing rate.
Keywords/Search Tags:String, Edit Distance, Similarity Query, Similarity Join, Filter, Index
PDF Full Text Request
Related items