Font Size: a A A

Approximate Similarity Search In High-Dimensional Euclidean Space

Posted on:2018-09-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q HuangFull Text:PDF
GTID:1318330536476259Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The problem of high-dimensional similarity search has wide applications in Database systems,data mining,and computational geometry.Given a similarity function,the problem of similarity search is to find the object in a database which is most similar to a given query.Based on different similarity functions,there are different problem formulations of similarity search.In the thesis,we focus on three fundamental problems of similarity search in Euclidean space: nearest neighbor search,furthest neighbor search and maximum inner product search.Due to the curse of dimensionality,finding exact answers in high dimensional space is extremely costly.Therefore,the approximate similarity search problem has attracted extensive studies for the nearly two decades.However,there are still a lot of challenges in high-dimensional similarity search:(1)For the problem of high-dimensional c-Approximate Nearest Neighbor(c-ANN)search,LocalitySensitive Hashing(LSH)and its variants are the most influential methods.However,due to the query-oblivious bucket partition,traditional LSH functions often lead to the incorrect partition problem.And due to the same reason,state-of-the-art LSH schemes for external memory,namely LSB-Forest and C2 LSH,only work with approximation ratio c ? 2.(2)For the problem of high-dimensional c-Approximate Furthest Neighbor(c-AFN)search,there is no efficient external method for high-dimensional approximate FN search.(3)For the problem of high-dimensional c-Approximate Maximum Inner Product(c-AMIP)search,the state-of-the-art methods,such as SignALSH and Simple-LSH,do not fully utilize the information of data distribution and the query,which lead to low search accuracy and efficiency.Motivated by the above problems,starting from the LSH function family and the LSH schemes,we have made deep studies of the problem of c-ANN search,c-AFN search,and c-AMIP search in high-dimensional Euclidean space.In summary,the main contributions of the thesis are described as follows.For the problem of high-dimensional c-ANN search,we first propose a novel query-aware LSH function for lpnorms,where p ?(0,2].Compared with traditional LSH function,the novel LSH function uses the query as anchor and partition the data objects when the query arrives,then the incorrect partition problem can be avoided.Based on the query-aware LSH functions,we propose a novel LSH scheme named Query-Aware LSH(QALSH)for ANN search over external memory.Our theoretical studies show that QALSH enjoys a guarantee on query quality,and works with any approximation ratio c > 1.Extensive experiments show that QALSH outperforms the state-of-the-art schemes LSB-Forest,C2 LSH,and SRS,especially in high-dimensional space.For the problem of high-dimensional c-AFN search,we first propose a novel concept of reverse LSH function family,and design a practical reverse query-aware LSH function family.Based on the new reverse query-aware LSH family,we propose two new hashing schemes RQALSH and RQALSH*for high-dimensional c-AFN search over external memory.Our theoretical analysis show that RQALSH enjoys a guarantee on query quality.The experimental results demonstrate that our proposed RQALSH and RQALSH*schemes significantly outperform state-of-the-art methods QDAFN and DrusillaSelect.For the problem of high-dimensional c-AMIP search,by leveraging the information of data distribution and the location of query,we propose a novel hashing scheme named Asymmetric LSH scheme based on Homocentric Hypersphere,or simply H2-ALSH.We show that H2-ALSH enjoys a guarantee on query quality,and works with any approximation ratio 0 < c < 1.Extensive experiments show that H2-ALSH can effectively improve the search accuracy,and H2-ALSH significantly outperforms state-of-the-art methods Sign-ALSH and Simple-LSH.
Keywords/Search Tags:Similarity Search, Locality-Sensitive Hashing, Query-Aware, HighDimensional Space
PDF Full Text Request
Related items