Research On String Similarity Search Algorithms

Posted on:2023-03-27

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Z Yang

Full Text:PDF

GTID:1528307172452484

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

String Similarity Search is one of the core functionality in a range of applications,including data cleaning,near-duplicate object detection and data integration in the database system.Given a set of data objects,a similarity function,and a query criterion,string similarity search aims to find all data objects that meet the query criterion under the similarity function.Due to the importance of the string similarity search,it has been intensively studied over the past few decades.However,with the rapid development of Internet and information technology,the amount of data generated in the real world has increased dramatically,string similarity search has ushered in new challenges and opportunities.On large-scale datasets,traditional string similarity search techniques usually suffer from low efficiency and excessive space consumption.While the emergence of new technologies,such as machine learning and spatio-temporal data system,etc.,also brings new prospects to string similarity search technology.Therefore,how to design new algorithms to make string similarity search more efficient,how to use new technologies such as machine learning to improve string similarity search,and how to apply string search technology in new fields such as spatiotemporal data systems,are all important issues to be studied.Based on above the three levels,string similarity search technique is explored.String similarity query and join algorithms under various similarity functions,learned index based string similarity search algorithms,and string similarity search based activity trajectory search algorithms are studied.Specifically:Firstly,the top-k overlap string similarity join problem is studied.A step size(that is,the number of elements accessed by the algorithm in each iteration)based algorithm is proposed to improve the query efficiency by optimizing the traditional algorithm.Theoretically analyzes are first given to demonstrate the positive influence of step size on the algorithm,proving the benefits of large step size,and conclude that there is an optimal fixed step size is obtained.Based on the conclusion,a fixed step size algorithm is proposed.Then,in order to make the algorithm more available in practice,an adaptive step size algorithm which automatically adjusts the step size during the algorithm process is proposed.The algorithm avoids manual step size setting and makes full use of the advantages of large step sizes.Then,extension to other similarity functions such as Jaccard and Cosine is introduced.Extensive experimental study shows that the proposed algorithms outperform the SOTA methods on large-scale real datasets with 4-14 times faster of query time.Secondly,the threshold-based string similarity search problem under edit distance is studied.A sketching method based indexing algorithm is proposed,and a machine learning based learned index technology is introduced into the index structure.The sketching method is used to capture the pivot characters in the string to construct a shorter string sketch representation,which can guarantee a high accuracy of the candidates.Based on the sketch representation,concise index structures are designed to search sketches,which substantially reduces the space consumption,and learned index is used to replace the length filtering structure to speed up searching on the index.Extensive experiments on real-world datasets show that compared with the SOTA methods,the proposed algorithm reduces the space consumption up to 75% reduction and achieves better performance with up to 60 times faster of query time.Finally,the string search based activity trajectory search problem is studied.A new distance measurement and a string encoding based grid tree threshold algorithm are proposed.First,a distance measurement that combines the three dimensions of time,space and keywords is designed for active trajectories,and an inverted index based threshold algorithm is proposed as baseline.Then,a string encoding based active grid tree index is designed to store trajectory points,and a grid tree threshold algorithm is proposed to quickly filter the candidates with large spatial distances and mismatched keywords.Furthermore,the method is extended to support parallel processing.Finally,Extensive experiments indicate the high efficiency of the proposed algorithms.Compared with the baseline,the query efficiency of the proposed algorithm is improved by 3-10 times.

Keywords/Search Tags:

Data Processing, Data Retrieve, Spatio-Temporal Data, String Similarity Search, Trajectory Search, Set Similarity, Edit Distance

PDF Full Text Request

Related items

1	Research On Top-k String Similarity Search Based On Edit Distance
2	Representing And Indexing Trajectories For Efficient Similarity Search With TRI Framework
3	Research On Technologies Of Duplicate Record Data Cleaning In Big Data Environment
4	Research On Similarity Search Technique For Big Data
5	Measurement And Application On User Trajectory Similarity Based On Spatio-temporal Data
6	Research And Design On The Similarity Search System For Massive Trajectories
7	Research On The Algorithms For String Similarity Search
8	Research On String Edit Similarity Join
9	Trajectory Similarity Algorithms And Applications Based On Spatial And Temporal Data
10	Multi-source Spatio-temporal Similarity Algorithm Based On Symbolic Representation Of Trajectory Movement Parameters