Font Size: a A A

On Hashing Algorithms For Pairwise Join

Posted on:2011-12-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:G ZhaoFull Text:PDF
GTID:1118360305497280Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In the applications of information retrieval and database, one of most basic and common search is to identify the pair of related objects from a group of data objects (e.g., documents or images). A typical application is the well-studied problem of similarity join in recent researches. In this thesis, the problem is extended and named of the pairwise join search:given a set of data objects and a relational measure operating on the objects, return all the pairwise of objects that have the measures over a pre-specified minimum threshold. The simplicity of the definition and its wide applicability extend its applications to various applications suh as duplicate detection, association rule mining, statistical analysis, collaborative filtering, and clustering. The research challenges it presents have received a lot of attentions in the past studies. A large body of heuristic methods have been proposed, e.g., Inverted-list indexing algorithms, prefix-filtering algorithms and quasi-monotonic pruning methods.The heuristic-type methods suffer from some bottlenecks in implementations, e.g., the efficiency of pruning can not be analytically ensured, the optimization in performance is hard to achieve for the various distributions of datasets, and a general model does not exist for the given problem. Further improvements in performance are always intractable for the exact methods. Of late, a body of works showed that an approximation result may be satisfied in a large body of applications, which can make a significant improvement in query efficiency. Such a principle also applies to our problem of pairwise join. In this thesis, we decide to solve the approximation version of pairwise join problem by using a group of randomized models. In such cases, some of our concerns are as follows:(1) For a dataset that can not fit into main memory, is there some mapping scheme that transfer the original data into a compressed "sketch"? (2) Is it possible to estimate the relation measure with an acceptable cost(say, on basis of the generated "sketch")? (3) How to avoid comparing all possible pairs, i.e., how to perform a cost-effective filtering priori to comparisons?Enlightened by the Locality-sensitive Hashing (LSH) technique used in (approximate) Nearest Neighbor Search, We find that the similar hashing scheme provides the pairwise join with a good candidate for extracting a compressed sketch from the original data and thus develop a family of randomized models for pairwise join problem, which are called hashing algorithms in this thesis, due to the fact that all the proposed models operate on the sketch generated by hashing scheme. In summary, the theoretical contributions in this thesis are as follows: (1) We study the relationship between the existence of expected hashing scheme and the given relational measures in pairwise join. This part of work essentially provides the applicable range of the proposed hashing techniques. Specifically, we construct a family of hashing schemes based on sampling and rounding schemes for some typical relational measures, from which we prove some necessary conditions on relational measures for the existence of hashing scheme satisfying our specific definition.(2) We propose an interval estimation model. On the basis of the sketch generated from original data by using hashing scheme, we design a simply implementable estimation scheme. Compared with the traditional estimation scheme used in past research, it admits a provable lower bound (logarithmic scale) for the number of hashing evaluations and can be optimizated by adjusting its parameters when constructing a pairwise join search algorithm. The substantial experiments showcase that it outperform the expectation estimation model abstracted from the past studies.(3) We design a cost-effective randomized filter model. This type of randomized model has lower time and space requirements than estimation-type models. Specifically, we first formalize the scheme used in Nearest Neighbor search as Basic LSH (B-LSH) scheme and analyze its properties and main bottleneck in performance. Then we propose an enhanced randomized filter model, which is called Approximate LSH (A-LSH) filter. It proved that the number of hashing evaluations in A-LSH scheme is logarithmic in scalability, which states a significant reduction of sublinear dependency in B-LSH scheme. Furthermore, we proved that the A-LSH filter does not suffer from the quality-efficiency tradeoff that B-LSH filter is subjected to.In application aspects, we put the general theoretical models into various application settings and thus develop a group of randomized algorithms. The achievements in this part of works are as follows:(1) Mining confidence association rules. Fast identifying the associations from a set of infrequent items is essential in various applications. By extending the general estimation model, we obtain an interval estimation scheme for confidence measure, and develop an efficient algorithm for fast mining pairwise associations. The experiments conducted on both real and synthetic data show that such a randomized method achieved remarkable achievements in performance when compared with some exact algorithm based on tree-enumeration. Solving such a problem showcased how the original estimation model was extended and re-designed to solve the problem with an untypical relational measure.(2) Identifying the statistical correlated items. Correlation computing plays an important role in many applications, such as machine learning, climate studies. The heuristic method based on quasi-monotonic property in past work is unsatisfied in many cases. We found that the given correlation coefficient (Pearson coefficient) can be solved by hashing scheme. Specifically, a pruning algorithm based on A-LSH model remarkably outperforms the exact method and the algorithm based on B-LSH. Solving such a problem states the advantage of A-LSH scheme and showcases how the applicable range of the randomized filter model is extended in practice.(3) Similarity join on weighted sets. Considering the weight information for set objects when performing a similarity join operation indicates the deeper understanding of objects and thus more accurate results. For given weighted set similarity, both the estimation model and the filter model have been produced (on the basis of different hashing scheme) for solving the problem. Substantial experiments obtain a comparative performance analysis. Solving such a problem states the effects of choosing various models, at the premise that both randomized are applicable for the given problem.In summary, we focus on a basic search of pairwise join in this thesis, and obtain a family of random results for the given problem. We discuss the applicable range and necessary conditions of proposed randomized technique, and develop a group of general random models. Furthermore, we provide several chosen applications with a family of solutions based on theoretical models. All the experimental results evidence that compared with both according exact methods and randomized schemes in past studies, performing the proposed models achieves the significant performance gain in running time, scalability and cost-effectiveness in almost all test cases.
Keywords/Search Tags:Pairwise Join, Hashing Scheme, Sketch, Estimation Model, Filter Model
PDF Full Text Request
Related items