Font Size: a A A

Efficient Algorithms For Approximate Aggregation And Nearest Neighbor Queries Over Multi-Dimensional Data

Posted on:2022-10-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:H HuFull Text:PDF
GTID:1488306569484224Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the wide application of computer technology in all walks of life,a large number of applications have been able to collect a lot of data.It is a big challenge to process big data efficiently.A common way to improve the efficiency is to adopt approximate approaches.The motivation is that approximate results are acceptable in many application scenarios.This dissertation focuses on two kinds of important queries over multi-dimensional data,which are aggregation queries and nearest neighbor search(a.k.a.nearest neighbor queries).Briefly,an aggregation query is a SQL query that returns one or more aggregate values like AVG,SUM,etc.As a basic component of online analytical processing,aggregation queries are usually processed in decision-support systems.Given any query point,nearest neighbor search is to find a data point which is the closest to the query point.Nearest neighbor search is fundamental to a wide range of applications,such as image retrieval,product recommendation,pattern recognition and so on.When data cardinality is high,random sampling is often applied to reduce the data to be processed so as to efficiently answer aggregation queries,and when data dimensionality is high,Symmetric Locality-Sensitive Hashing(LSH for short)and Asymmetric Locality-Sensitive Hashing(ALSH for short)are often adopted to remove the"curse of dimensionality"so that nearest neighbor search can be performed efficiently.This dissertation investigates four problems related to sampling-based approximate aggregation query processing and LSH or ALSH-based approximate nearest neighbor search.The research issues and contributions of this dissertation are summarized as follows.Firstly,this dissertation studies the problem of online sampling-based error-bounded approximate aggregation query processing under the assumption that sample variance and dataset variance are equal.Actually,most of the related works need the above assumption.To address the problem,the bit-oriented sampling paradigm is intensively studied for AVG and SUM aggregations for the first time.It selects samples at the granularity of bit instead of data item,which is different and more flexible than the traditional data item-oriented sampling paradigm.A bit-oriented uniform sampling-based approximate aggregation method called DVBM is proposed as a solution to the problem.The method minimizes the sampling size for a given error bound on the aggregation result.Theoretically,we prove that the sampling size required by DVBM is smaller than that required by the data item-oriented uniform sampling-based approximate aggregation method.Furthermore,the experimental results show that compared with the baseline,the efficiency of DVBM is several times higher and the resulting aggregation results are more accurate as well.However,when data is highly skewed,DVBM and the baseline can hardly guarantee a given error bound since sample variance can be seriously deviated from dataset variance and thus the above assumption does not hold.Secondly,this dissertation studies the problem of online sampling-based approximate aggregation query processing with a strong error bound guarantee.To address the prob-lem,we propose a bit-oriented uniform sampling-based approximate aggregation method called DVFM.In order to provide a strong error bound guarantee,the method does not rely on the above assumption.However,it could result in a larger sampling size than DVBM.In DVFM,two bit-oriented uniform sampling-based approximate aggregation al-gorithms are introduced,which are called one-estimation algorithm and multi-estimation algorithm,respectively.The efficiencies of these algorithms are verified with extensive experiments.Specifically,compared with several baselines that can also provide a strong error bound guarantee,one-estimation algorithm and multi-estimation algorithm usually perform much better in terms of efficiency.Moreover,the experimental results show that they can indeed guarantee a given error bound absolutely.One-estimation algorithm and multi-estimation algorithm need to set a different set of parameters a priori.The values of the parameters directly affect the efficiencies of them.In practice,it is easier to set the parameters for multi-estimation algorithm to achieve good performance.Thirdly,this dissertation studies the problem of supporting nearest neighbor search over a set of weighted distance functions with respect to the lp distance,where the set is given beforehand and?(0,2]is an input parameter.The problem arises in applications such as personalized recommendation.Fundamentally,it is to support nearest neighbor search for each weighted lp distance function in a given set.In the problem setting,the elements of the weight vector associated with a weighted lp distance function are assumed to be positive.To solve the problem,we propose the first LSH-based method called WLSH.It is the only existing method that is available for any?(0,2]and can provide an approximation ratio guarantee.It is based on two kinds of LSH families we propose for weighted distance functions:weighted LSH family and derived weighted LSH family.The optimization goal of WLSH is to process each query efficiently with accuracy guarantees while minimizing the required total number of hash tables.In addition,two optimization techniques are proposed to balance the query efficiency,query accuracy and space consumption of WLSH.Extensive experiments are conducted on both synthetic and real datasets,and the results verity the performance of WLSH in terms of query efficiency,query accuracy and space consumption.Finally,this dissertation studies the problem of supporting nearest neighbor search over the generalized weighted Manhattan distance.The problem arises in applications like personalized recommendation and the training of KNN classifier.Briefly,the generalized weighted Manhattan distance is a distance measure whose distance function is such a weighted l1distance function that the elements of the associated weight vector are taken as variables rather than constants and their values which can be any real numbers are specified along with each query.It is easy to know that the generalized weighted Manhattan distance does not obey the triangular inequality and thus is not a metric.We first prove that there is no LSH or ALSH scheme for solving the problem when data points and query points are located in the entire space Rd.Then,we prove that there is still no LSH scheme for solving the problem when data points and query points are both located in a bounded space.After that,two ALSH schemes(dwl1,l2)-ALSH and(dwl1,?)-ALSH are proposed based on two ALSH families we introduce for the generalized weighted Manhattan distance over a bounded space.They are the only two existing methods that can solve the problem in sublinear time.Fundamentally,the two ALSH schemes reduce the problem to a decision version of it,which is the (R1,R2)-NN search problem over the generalized weighted Manhattan distance.A sufficient"good"result can be obtained by varying the parameters R1 and R2.We demonstrate the performance of(dwl1,R2)-ALSH and(dwl1,l2)-ALSH with extensive experiments.
Keywords/Search Tags:aggregation query, nearest neighbor search, approximate query processing, locality-sensitive hashing
PDF Full Text Request
Related items