| String Similarity Search is one of the core functionality in a range of applications,including data cleaning,near-duplicate object detection and data integration in the database system.Given a set of data objects,a similarity function,and a query criterion,string similarity search aims to find all data objects that meet the query criterion under the similarity function.Due to the importance of the string similarity search,it has been intensively studied over the past few decades.However,with the rapid development of Internet and information technology,the amount of data generated in the real world has increased dramatically,string similarity search has ushered in new challenges and opportunities.On large-scale datasets,traditional string similarity search techniques usually suffer from low efficiency and excessive space consumption.While the emergence of new technologies,such as machine learning and spatio-temporal data system,etc.,also brings new prospects to string similarity search technology.Therefore,how to design new algorithms to make string similarity search more efficient,how to use new technologies such as machine learning to improve string similarity search,and how to apply string search technology in new fields such as spatiotemporal data systems,are all important issues to be studied.Based on above the three levels,string similarity search technique is explored.String similarity query and join algorithms under various similarity functions,learned index based string similarity search algorithms,and string similarity search based activity trajectory search algorithms are studied.Specifically:Firstly,the top-k overlap string similarity join problem is studied.A step size(that is,the number of elements accessed by the algorithm in each iteration)based algorithm is proposed to improve the query efficiency by optimizing the traditional algorithm.Theoretically analyzes are first given to demonstrate the positive influence of step size on the algorithm,proving the benefits of large step size,and conclude that there is an optimal fixed step size is obtained.Based on the conclusion,a fixed step size algorithm is proposed.Then,in order to make the algorithm more available in practice,an adaptive step size algorithm which automatically adjusts the step size during the algorithm process is proposed.The algorithm avoids manual step size setting and makes full use of the advantages of large step sizes.Then,extension to other similarity functions such as Jaccard and Cosine is introduced.Extensive experimental study shows that the proposed algorithms outperform the SOTA methods on large-scale real datasets with 4-14 times faster of query time.Secondly,the threshold-based string similarity search problem under edit distance is studied.A sketching method based indexing algorithm is proposed,and a machine learning based learned index technology is introduced into the index structure.The sketching method is used to capture the pivot characters in the string to construct a shorter string sketch representation,which can guarantee a high accuracy of the candidates.Based on the sketch representation,concise index structures are designed to search sketches,which substantially reduces the space consumption,and learned index is used to replace the length filtering structure to speed up searching on the index.Extensive experiments on real-world datasets show that compared with the SOTA methods,the proposed algorithm reduces the space consumption up to 75% reduction and achieves better performance with up to 60 times faster of query time.Finally,the string search based activity trajectory search problem is studied.A new distance measurement and a string encoding based grid tree threshold algorithm are proposed.First,a distance measurement that combines the three dimensions of time,space and keywords is designed for active trajectories,and an inverted index based threshold algorithm is proposed as baseline.Then,a string encoding based active grid tree index is designed to store trajectory points,and a grid tree threshold algorithm is proposed to quickly filter the candidates with large spatial distances and mismatched keywords.Furthermore,the method is extended to support parallel processing.Finally,Extensive experiments indicate the high efficiency of the proposed algorithms.Compared with the baseline,the query efficiency of the proposed algorithm is improved by 3-10 times. |