Font Size: a A A

Research On Locality Sensitive Hashing-Based Similarity Search

Posted on:2013-06-29Degree:MasterType:Thesis
Country:ChinaCandidate:K LingFull Text:PDF
GTID:2248330371988442Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Similarity Search is a fundamental problem in Computer Science, and it has been widely used in many fields, including data mining, machine learning, signal processing, information retrieval and so on. Originally, people tried to solve Similarity Search problem by finding the Nearest Neighbor (NN) of the query object and proposed a lot of indexing methods based on Branch and Bound techniques, which have good performance in low-dimensional space. However, when it comes to high-dimensional space, these algorithms for NN search are proved to be no better than linear search, which is known as "curse of dimensionality". Some researchers proposed to improve the time efficiency by applying Approximate Nearest Neighbors (ANN) search in recent years, because ANN can act as good as the exact one in many applications.Locality Sensitive Hashing (LSH) is one of the most popular ANN algorithms, which not only has solid theoretical foundation but also can achieve good performance in high-dimensional space. However, there is a trade-off between search efficiency and memory usage. To achieve high search efficiency, LSH requires quite a lot of memory due to a large number of hash tables used. What’s more, for search time cost, LSH still has room for improvement. Although some researchers have done a lot of work to improve LSH, these improvements are all based on generic LSH scheme. However, generic LSH scheme has some limitations. Besides, these methods can not provide satisfying search efficiency in large data sets.To propose new algorithms which can solve Similarity Search problem better, the following work has been done in this paper:1. We analyze the idea for LSH to solve Similarity Search problem and point out the drawbacks of generic LSH scheme:it requires a large number of hash tables to achieve high search efficiency; the ability to filter points far away from query point is bad; it does not make full use of the properties of hash functions.2. We propose a new LSH-based algorithm called Frequency Based Locality Sensitive Hashing (FBLSH). FBLSH can make full use of the properties of hash functions based on p-stable distributions, and it can filter points far away from query point better with a small number of hash tables by counting the collisions. We prove the correctness of this algorithm and give a simple analysis of the computational complexity of it. Through experiments, we show that FBLSH outperforms LSH based on p-stable distributions.3. To deal with large-scale data efficiently, we propose a hybrid index structure called HKF tree which can reduce the search time cost in large data sets with small amounts of memory. HKF tree uses hierarchical k-means tree as its main structure and builds FBLSH index in each leaf node. We prove the correctness of HKF tree and analyze the computational complexity of it. Through experiments, we show that the performance of it in large data sets is better than hierarchical k-means tree, FBLSH and Posteriori Multi-Probe LSH.
Keywords/Search Tags:Similarity Search, LSH, high-dimensional index, k-means clustering
PDF Full Text Request
Related items