Font Size: a A A

Similarity Search On Large-scale High-dimensional Data

Posted on:2021-04-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:X K FengFull Text:PDF
GTID:1488306050463734Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
High-dimensional similarity search is a key fundamental component for effective processing and analysis of large-scale unstructured data,and has wide applications in database,information retrieval,machine learning,recommendation system and many other related areas.Given a query,similarity search refers to the process of searching the most similar data from a given data set to the query.In practical applications,this problem can be divided into various forms according to the similarity function and the different requirements on accuracy.This thesis mainly focuses on three important similarity search problems in high-dimensional Euclidean space:exact nearest neighbor(NN)search,approximate nearest neighbor(ANN)search and approximate furthest neighbor(AFN)search.Effective similarity search needs to meet the requirements of both instantaneity and accuracyLimited from the large scale and high dimensionality of massive unstructured data,current research in all the above three areas face great challenges.(1)The key of promoting exact NN search is to construct tighter distance lower bounds to lift the filtration rate of irrelevant data.Currently,the tightest one is a kind of data adaptive lower bound.However,it suffers from a high lower bound computation complexity and inefficient pruning inside clusters during the NN search,which draw strict limits on lifting the search speed.(2)For ANN search,a lot of advanced external solutions have been proposed based on locality sensitive hashing(LSH),however,they still suffer from unsatisfactory I/O efficiency due to large number of random I/O consumption or inadequate improvement on the local distribution of NN candidates.(3)Hashing methods and heuristic methods are the most advanced solutions for AFN search.However,neither of them can well adapt to the distribution of furthest neighbors to effectively improve the accuracy and efficiency of AFN search.In this thesis,we first conduct in-depth explorations on the causes of the above issues,and then design more efficient index structures and search algorithms to further improve the performance of large-scale high-dimensional similarity search.The main contributions of this thesis are summarized as follows:(1)For high-dimensional exact NN search,we design two new methods to address the limitations of the current method.Firstly,we design a fast estimation method to quickly determine the hyperplane bound providing the best lower bound of each cluster.In this way,we can ensure the compactness of the lower bounds as well as accelerating the lower bound computing.Secondly,we construct lower bounds for the member points of each cluster,which can help to filter irrelevant points inside clusters.Experimental results show that our new methods can effectively speedup the exact NN search.Compared with existing solutions,the I/O consumption and the CPU response time can be reduced by 20%and 30%,respectively.(2)In order to improve the I/O efficiency of high-dimensional ANN search,we firstly propose a novel LSH index called Optimal Order LSH(02LSH)to improve the local distribution of NN candidates.By introducing space-filling curves to sort the compound LSH keys and rearrange the original data accordingly,NN candidates can be stored on the same or adjacent disk pages so that we can load them via sequential I/O operations.To enquire the best local distribution,a thorough quantitative analysis is conducted on several common space-filling curves.The results show that the "neighborhood-first-traverse" curves can generate better local distribution then the row-wise curve conducted in the basic sortingbased method.By comparison,we choose the best "neighborhood-first-traverse" curve for O2LSH.Based on that,O2LSH further improves both the I/O efficiency and accuracy of ANN search and exhibits better practicality according to the experimental results.(3)Secondly,we propose a new LSH scheme SortingCodes-LSH(SC-LSH)to further reduce the I/O cost of ANN search by enlarging the capacity of NN candidates of each disk page.Most existing ANN solutions use original data points as NN candidates,the relatively large physical size of original data would draw stict limitations on the improvement of ANN search performance.In SC-LSH,we use discriminative short codes instead of original data as NN candidates.Benefited from the high space efficiency and good similarity-preserving ability of discriminative short codes,SC-LSH can greatly increase the number of NN candidates held in a single disk page.Then we introduce the above sorting method into SC-LSH to rearrange the storage of the codes.In this way,we can access sufficient candidate codes to guarantee the ANN search accuracy via much less sequential I/Os.Experimental evaluations show that SC-LSH can achieve the best search accuracy compared with the state-of-the-arts and also significantly reduce the I/O consumption and space consumption.(4)For high-dimensional AFN search,we conduct a series of observations on the distribution of furthest neighbor points,and find that there exist a "distance aggregation phenomenon"in the distribution of furthest neighbors,which we claim is the main cause of the limitation of existing hashing methods.Also,we find that the numbers of furthest neighbor points vary a lot in different data sets,which has a strong correlation with the performance differences of existing heuristic methods on different data sets.Based on these findings,we propose a data set adaptive strategy for AFN search.Firstly,by improving an AFN search difficulty evaluating approach,we divide the data sets into three difficulty levels.Then we improve an existing AFN search algorithm and design two new algorithms to do AFN search on different difficulty levels of data sets.Combining the improved difficulty evaluating approach with the three algorithms,we propose a data set adaptive AFN search solution.Experimental results show that it can achieve the best AFN search performance on different data sets and exhibits good practicability.
Keywords/Search Tags:large-scale high-dimensional data, curse of dimensionality, high-dimensional index, similarity search, nearest neighbor search, furthest neighbor search, distance lower bound, I/O performance bottleneck
PDF Full Text Request
Related items