Font Size: a A A

Efficient KNN Processing On High-Dimensional Data Streams

Posted on:2017-11-10Degree:DoctorType:Dissertation
Country:ChinaCandidate:C YangFull Text:PDF
GTID:1318330512452730Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of computer science, data collection is getting easier and easier. Both the volume and the dimensions (features) of data are increasing rapidly. Various objects, including web documents, gene expression data, and multimedia data can be represented by high-dimensional vectors (or points in high-dimensional space) through some feature extraction methods. However, because of the affection of "dimensionality curse", query processing on high-dimensional datasets is extremely expensive. The efficiency of many multidimensional indexes decreases rapidly with the increment of dimensionality. A common operation in the fields of database, information retrieval, and data mining is to find the most similar objects in a dataset for a query (or each query in a query set). Given a distance metric (e.g., Euclidean distance) to measure the similarity, this operation can be transformed to the high-dimensional k nearest neighbors (or k nearest neighbors join) problem. Currently, high-dimensional kNN and kNN join problems are mostly studied on static datasets. However, in more and more appications we need to monitor the kNN or kNN join results continuously on data streams, which is a new problem to be studied.Firstly, we study the problem of continuous kNN join processing on high-dimensional data streams, which requires real-time reaction of a given query set on every update. Our basic idea is to firstly find the queries that can be affected by the updated point, and then only update the kNN results of this part of queries. The procedure of searching for the affected queries can be seen as a reverse kNN problem. We introduce a new index structure called the High-Dimensional R-tree (HDR-tree) to efficiently search for affected queries by the updated points. HDR-tree is built on the query set and contains information of the kNN join results. We use clustering and principle component analysis (PCA) techniques for dimensionality reduction to lower the search cost, and at the same time guarantee the accuracy of results.Furthermore, we study the problem of approximate processing for continuous kNN joins on high-dimensional data streams to further increase efficiency, as in many cases, it is acceptable that approximate kNN join results are returned. The reduction in precision is often a sacrifice we could make in return for faster response. To this end, we propose two index structures to support the fast approximate retrieval of affected queries:HDR*-tree and LSH-M. HDR*-tree utilizes random projection for dimensionality reduction instead of PC A. As we can approximately preserve the distance relationship of points after random projection, HDR*-tree can be more efficient than HDR-tree with a reasonable accuracy lower bound. LSH-M, which firstly uses a set of carefully designed locality sensitive hashing (LSH) functions on the query set to filter out a candidate set of affected queries, and then identifies affected queries by direct distance computation with queries in the candidate set. LSH-M is also demonstrated to perform more efficiently than the baseline methods and the HDR-tree-based method with a reasonable accuracy lower bound. We analyze the different properties of HDR*-tree and LSH-M and discuss their respective applicable cases. Experiments on both real world and synthetic datasets demonstrate the efficiency of both methods.Finally, to further increase the efficiency and scalability to handle with large volume of data, we address the problem of distributed kNN processing on high-dimensional data streams. Existing methods on high-dimensional kNN problem are mostly designed for static datasets or on a single centralized server, thus are hard to be adapted to our problem. We propose a new distributed index called the Dynamic Bounded Rings Index (DBRI). DBRI partitions the data points into the bounded rings surrounding some selected pivots based on their distance to the nearest pivot. Each data point can only be within one bounded ring. DBRI can be easily maintained and can be dynamically adapted to the data stream. DBRI suites the distributed environment well as individual rings can be allocated to different nodes. We propose the distributed high-dimensional kNN query algorithm (DHDKNN) based on DBRI. DHDKNN involves two iterations in the kNN processing to avoid calculating to all points, which leads to less communication cost and predictable performance compared with existing methods. We implement DBRI and DHDKNN on the top of Apache Storm, an open source distributed platform for stream processing, and extensive experiments on both real world dataset and synthetic dataset are conducted to evaluate the performance of our methods.
Keywords/Search Tags:High-Dimensional Data, Data Streams, KNN, Index
PDF Full Text Request
Related items