Font Size: a A A

Research On Hashing Methods Towards Large-scale Similarity Computation And Search

Posted on:2016-04-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:J Q JiFull Text:PDF
GTID:1108330503456152Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Data explodes as the Internet grows. E?ectively computing the pairwise similarity and conducting similarity search in the context of large-scale data is a fundamental yet challenging problem which has a wide range of applications. Locality-sensitive hashing is an e?ective tool towards this problem. Locality-sensitive hashing methods compress data into compact hash codes. By computing simple distance(e.g. Hamming distance)between hash codes we can e?ciently estimate the pairwise similarity and distance between original data. And thus this kind of similarity-preserving hash codes can serve as compressed representations of original data, which can be used as inputs to various machine learning or data mining algorithms based on pairwise similarity, leading to acceleration and storage reduction. Besides, hash codes can be indexed in a hash table to enable fast similarity search. There are various kinds of locality-sensitive hashing methods for di?erent kinds of data and similarity. The existing locality-sensitive hashing methods su?er from their large estimation variance, and thus it takes many such hash functions to achieve desired level of precision, leading to expensive computation and storage cost. In this paper, We first study locality-sensitive hashing methods for simple-structured data,e.g. vectors and sets, and propose to sample hashing functions in a structured manner,to reduce the estimation variance or hashing time. On the other hand, previous studies mainly focus on hashing methods for simple-structured data, while many kinds of data in real applications cannot be represented by simple structures, such as subspace and graph.We propose a hashing method for subspaces, and an extension to 3rd-order tensors. More specifically, we make the following contributions:1. For vectors, we propose batch-orthogonal locality-sensitive hashing(BOLSH),which is an improvement over sign-random-projection locality-sensitive hashing(SRPLSH). BOLSH uses hashing functions sampled in a structured manner, i.e., hashing functions constructed by batch-orthogonal random projection vectors. We provide theoretical analysis to show that BOLSH can provide unbiased estimate of the angles between vectors, with smaller variance compared to SRP-LSH. Our experiments verifies the theoretical results.2. For sets, based on min-hashing method, we propose min-max hashing method,again using hashing functions sampled in a structured manner, i.e., min functions and max functions. We prove that min-max hashing reduces the hashing time by half, and provides unbiased estimate of Jaccard similarity with smaller variance. Our experiments verifies the theoretical results.3. We propose an angular-similarity-preserving locality-sensitive hashing method for subspaces and 3rd-order tensors, and provide theoretical analysis and experiment verification. We also propose fast sign-random-projection method and bilinear random projection method to reduce the hashing time.
Keywords/Search Tags:locality-sensitive hashing, Hamming distance, similarity search
PDF Full Text Request
Related items