With the increasing amount of processing data,how to query in the dataset has become more important.Under this background,nearest neighbor query is the basic algorithm which is broadly applied.For low-dimensional data space,data structures like B-tree,R-tree can be used to solve the problem about near neighbor searching.However,with the increasing of dimension,the efficiency of query will decrease dramatically because of curse of dimensionality.At present,there are two main ways to solve the problem.First method is decreasing high dimensions.However this method needs to train dataset,thus it isn't suitable for dynamic dataset.Second one is constructing high-dimensional indexing,which substitute storage for speed of query.Among all the hashing dimensional indexing,locality sensitive hashing algorithm(LSH)is one of the most widely accepted methods.One advantage is that it can guarantee the accuracy of query with constant probability while it's hard for many other data-independent algorithms to offer this probability.The nature of probability p is as follows: For a fixed query q,N data structures are randomly generated(N is large).The ratio that q collide with some r-near neighbor on the N is very close to p?However,in practice,LSH data structure is fixed and recall rate is used to by users to determine the performance.Recall rate means that for any random query q,the ratio of r-near neighbors which collide with q on the fixed data structure.Clearly,p and recall rate are completely different concepts.However,in practice both are very close,which is verified by many experiments.This is the reason why LSH method is widely used.This paper theoretically researches the reason why p is close to recall rate,which comprise two aspects.(1)Under the condition of randomness of dataset and fixed single hashing function,our paper calculate the recall rate under three main measure spaces(L2norm.Arccos measure,Hamming space)and find that the recall rate is asymptotically equal(they are exactly equal in Hamming space).According to the theoretical results of our paper,the length of hashing function can be controlled in L2 space and a new hashing family can be constructed in order to lead the variance of recall rate to 0,which make recall rate and theoretical p more close.(2)For general dataset,it can be found that the expectation of LSH data structure's recall rate is p.Our paper analyses the variance of recall rate and the relationship between variance of singlehashing function and parameters k,L.Meanwhile the upper bound of variance is derived.Our paper chooses some practical parameters and accurately calculates the variance of recall rate.The result implies that the variance is 2-3 orders of magnitude smaller than variance of single hashing,which explain the reason why practical data structure can provide a good guarantee of performance.According the results of our paper,If the parameters k,L is made reasonable and the dataset is large,The variance of recall rate is very small,which guarantee that if a dataset and a data structure is randomly picked and fixed,the probability p is highly likely to be close to recall rate by Chebyshev's theorem. |