Similarity Search On Large-scale High-dimensional Data

Posted on:2021-04-01

Degree:Doctor

Type:Dissertation

Country:China

Candidate:X K Feng

Full Text:PDF

GTID:1488306050463734

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

High-dimensional similarity search is a key fundamental component for effective processing and analysis of large-scale unstructured data,and has wide applications in database,information retrieval,machine learning,recommendation system and many other related areas.Given a query,similarity search refers to the process of searching the most similar data from a given data set to the query.In practical applications,this problem can be divided into various forms according to the similarity function and the different requirements on accuracy.This thesis mainly focuses on three important similarity search problems in high-dimensional Euclidean space:exact nearest neighbor(NN)search,approximate nearest neighbor(ANN)search and approximate furthest neighbor(AFN)search.Effective similarity search needs to meet the requirements of both instantaneity and accuracyLimited from the large scale and high dimensionality of massive unstructured data,current research in all the above three areas face great challenges.(1)The key of promoting exact NN search is to construct tighter distance lower bounds to lift the filtration rate of irrelevant data.Currently,the tightest one is a kind of data adaptive lower bound.However,it suffers from a high lower bound computation complexity and inefficient pruning inside clusters during the NN search,which draw strict limits on lifting the search speed.(2)For ANN search,a lot of advanced external solutions have been proposed based on locality sensitive hashing(LSH),however,they still suffer from unsatisfactory I/O efficiency due to large number of random I/O consumption or inadequate improvement on the local distribution of NN candidates.(3)Hashing methods and heuristic methods are the most advanced solutions for AFN search.However,neither of them can well adapt to the distribution of furthest neighbors to effectively improve the accuracy and efficiency of AFN search.In this thesis,we first conduct in-depth explorations on the causes of the above issues,and then design more efficient index structures and search algorithms to further improve the performance of large-scale high-dimensional similarity search.The main contributions of this thesis are summarized as follows:(1)For high-dimensional exact NN search,we design two new methods to address the limitations of the current method.Firstly,we design a fast estimation method to quickly determine the hyperplane bound providing the best lower bound of each cluster.In this way,we can ensure the compactness of the lower bounds as well as accelerating the lower bound computing.Secondly,we construct lower bounds for the member points of each cluster,which can help to filter irrelevant points inside clusters.Experimental results show that our new methods can effectively speedup the exact NN search.Compared with existing solutions,the I/O consumption and the CPU response time can be reduced by 20%and 30%,respectively.(2)In order to improve the I/O efficiency of high-dimensional ANN search,we firstly propose a novel LSH index called Optimal Order LSH(02LSH)to improve the local distribution of NN candidates.By introducing space-filling curves to sort the compound LSH keys and rearrange the original data accordingly,NN candidates can be stored on the same or adjacent disk pages so that we can load them via sequential I/O operations.To enquire the best local distribution,a thorough quantitative analysis is conducted on several common space-filling curves.The results show that the "neighborhood-first-traverse" curves can generate better local distribution then the row-wise curve conducted in the basic sortingbased method.By comparison,we choose the best "neighborhood-first-traverse" curve for O2LSH.Based on that,O2LSH further improves both the I/O efficiency and accuracy of ANN search and exhibits better practicality according to the experimental results.(3)Secondly,we propose a new LSH scheme SortingCodes-LSH(SC-LSH)to further reduce the I/O cost of ANN search by enlarging the capacity of NN candidates of each disk page.Most existing ANN solutions use original data points as NN candidates,the relatively large physical size of original data would draw stict limitations on the improvement of ANN search performance.In SC-LSH,we use discriminative short codes instead of original data as NN candidates.Benefited from the high space efficiency and good similarity-preserving ability of discriminative short codes,SC-LSH can greatly increase the number of NN candidates held in a single disk page.Then we introduce the above sorting method into SC-LSH to rearrange the storage of the codes.In this way,we can access sufficient candidate codes to guarantee the ANN search accuracy via much less sequential I/Os.Experimental evaluations show that SC-LSH can achieve the best search accuracy compared with the state-of-the-arts and also significantly reduce the I/O consumption and space consumption.(4)For high-dimensional AFN search,we conduct a series of observations on the distribution of furthest neighbor points,and find that there exist a "distance aggregation phenomenon"in the distribution of furthest neighbors,which we claim is the main cause of the limitation of existing hashing methods.Also,we find that the numbers of furthest neighbor points vary a lot in different data sets,which has a strong correlation with the performance differences of existing heuristic methods on different data sets.Based on these findings,we propose a data set adaptive strategy for AFN search.Firstly,by improving an AFN search difficulty evaluating approach,we divide the data sets into three difficulty levels.Then we improve an existing AFN search algorithm and design two new algorithms to do AFN search on different difficulty levels of data sets.Combining the improved difficulty evaluating approach with the three algorithms,we propose a data set adaptive AFN search solution.Experimental results show that it can achieve the best AFN search performance on different data sets and exhibits good practicability.

Keywords/Search Tags:

large-scale high-dimensional data, curse of dimensionality, high-dimensional index, similarity search, nearest neighbor search, furthest neighbor search, distance lower bound, I/O performance bottleneck

PDF Full Text Request

Related items

1	Research On K-nearest Neighbor Search Algorithm In High Dimensional Space
2	Learning And Indexing Structural Representations Of Large Scale High Dimensional Data
3	Multiple Hash Tables Indexing And Optimization For Approximate Nearest Neighbor Search
4	Hash-based Approximate Nearest Neighbor Search For High-dimensional Data
5	NAQ-tree: Effective index structure for similarity search in high dimensional space
6	Research On Approximate Nearest Neighbor Search And Maximum Inner Product Search For High-dimensional Dataset
7	Research Of Nearest Neighbor Search Based On VQ
8	Research On The Index Of Massive High Dimensional Data Via Multiple Hash Tables
9	Locality Sensitive Hashing Index Based On Neighborhood Collision Counting
10	A Study Of Nearest Neighbor Search In High-dimentional Space