Font Size: a A A

Research On K Nearest Neighbor Queries And KNN-joins In High-dimensional Space

Posted on:2016-09-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q S DuFull Text:PDF
GTID:1228330467995433Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Over the years, a considerable amount of research work focused on the multidimensionaldata index. Compared to the one-dimensional case, the design of multidimensional accessmethods is difficult because there is no the total order reserved local space. For a givenmultidimensional database, once we find a total order, we can use simple the one-dimensionalaccess method as everyone knows, for example B+tree. It can achieve the good queryperformance. A mapping scheme is the space filling curve. The space filling curve has beenused in a variety of areas, such as scientific computing, database system, geographicinformation system and digital image processing. It gives the corresponding relationshipbetween the point coordinate and its sequence numbers in the curve. The Z curve has beenused in most cases. However most previous works show the Hilbert curve has the betterclustering property than other curves. In order to reduce the additional seek time, it can obtaina set of contiguous disk blocks more effectively rather than a random scattered settings.Therefore it is desirable that the close objects in the multi-dimensional attribute space are alsoclose in the one-dimensional disk space. The good clustering can reduce the disk access timefor range queries. In the paper, the random shift method on the base of the Hilbert curve isadopted to solve the problem of the K nearest neighbor query and the K nearest neighbor joins.In data mining and data analysis fields, the K nearest neighbor query and K nearest neighborjoin are both widely applied. Especially the K nearest neighbor join is complex because It is acombination of the k nearest neighbor query and the join operation. MapReduce is aprogramming model to deal with the large-scale data set in a cluster computing node with theparallel mechanism. It is popular and widely used in the commercial and scientific computingdue to its simplicity, flexibility, fault tolerance. We also discuss the solution of the K nearestneighbor joins in MapReduce programming model. This paper has some help for the manyapplications in data mining fields. Our work is introduced in the following.With the Hilbert curve, the points in a multi-dimensional space can be mapped into onedimension. For a query point, the kNN search can be translated into the range search aroundthe point’s Hilbert-value. The Hilbert curve can not always preserve the spatial locality so that it sometimes leads to the potential errors in the final answer. In the first place we adopt therandomly shifted copy method. The randomly shifted versions of the input data set aregenerated independently. For each randomly shifted copy, the kNN search can be repeated. Inthis paper, we just use the O(1) shift that we can find experime ntally to give the excellentapproximation quality for the kNN query. This remedy can make errors infrequent. In addition,the Hilbert Curve manifests superior data clustering properties so that it can reduce additionalseek time. Finally, it is important that randomly shifted tables are created only once and canbe used for multiple, different queries. The HKNN is also a better choice.In many data mining applications, the k nearest neighbor join is widely utilized as aprimitive operation. It is a combination of the k nearest neighbor query and the join operation.In recent years, the k nearest neighbor join in high dimensional data is a topic of interest forresearchers. By the way of the Hilbert curve, the points in a multi-dimensional space can bemapped into one dimension. However the Hilbert curve can not always preserve the spatiallocality. Sometimes the potential errors may exit. The randomly shifted versions of the inputdata set are produced independently. By the random shift operation, we shift the data severaltimes. The kNN join can be operated for each randomly shifted copy. Then we are more likelyto find the true nearest neighbor by looking at the Hilbert predecessors and successors of thequery point on every shifted list. Examining more than one predecessor and successor onlyimproves the quality of the approximation. Then better results will be achieved. Although theHKNNJ algorithm is not faster than the zχ-kNNJ algorithm, it has superior clusteringproperties and be of practical value. Therefore it is also a better choice.The k nearest neighbor join is a primitive operation widely used in many data miningapplications. For every recorder of one dataset R, it can find the k nearest neighbors fromanother dataset S. It is a combination of the join operation and the k nearest neighbor query.Therefore it is an expensive and difficult operation. In recent years, the k nearest neighborjoin in high dimensional data has received a lot of attention by researchers. With thedimensionality increasing, the distance between the closest and the farthest neighbordecreases rapidly, for most of the datasets, which is also known as the curse of dimensionality.In this paper we investigate how to perform the knn join with the Hilbert curve usingMapReduce which is a framework for efficient, fault tolerant and workload distribution inlarge clusters. We put forword a three stage approach and explored each stage. Given ourproposed algorithms, we implemented them in Hadoop and analyzed their performancecharacteristics on real datasets.In this paper, we investigate how to perform kNN joins using MapReduce. It is the straightforward method for kNN joins to adopt the block nested loop methodology. The mainidea is to partition R and S into some equal-sized blocks. Every possible pair of blocks (onefrom R and one from S) is partitioned into a bucket. For every record in the local block of Rin that bucket, it can find kNNs in the local block of S using a nested loop. Bulk-loading anR-tree is very efficient and kNN search in an R-tree is also efficient. On the base, we build aHilbert R-tree index for the local S block in a bucket to help find kNNs in the same bucket.Extensive experiments demonstrate that our proposed methods are efficient and scalable.
Keywords/Search Tags:K Nearest Neighbor, the Hilbert Curve, the Space-filling Curve, K Nearest NeighborJoin, MapReduce, Hilbert R-tree
PDF Full Text Request
Related items