Font Size: a A A

Research Of DBSCAN Algorithm Based On Locality Sensitive Hashing Method

Posted on:2016-07-17Degree:MasterType:Thesis
Country:ChinaCandidate:Y H LuFull Text:PDF
GTID:2308330470969717Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Clustering is the most common used and more powerful unsupervised learning technique in data mining. It is a useful process which aims to organize the input dataset into a set of finite number of semantically consistent groups with respect to some similarity measure. As a member of clustering algorithms, DBSCAN (Density-Based Spatial Clustering of Applications with Noise) proposed is the first algorithm that implements the density-based strategy and it is one of the commonly used algorithms. It is popular because of the capability of discovering clusters with arbitrary shapes without any preliminary information about the groups present in a data set.Although DBSCAN algorithm has a lot of obvious advantages, it also has a lot of shortcomings which can’t be ignored. On the one hand, it takes a lot of time for the traditional DBSCAN algorithm to query the nearest neighbors of each object. This situation will deteriorate when this algorithm is used to deal with high-dimensional and huge datasets. On the other hand, the traditional DBSCAN algorithm also has some disadvantages in the accuracy of clustering results like the inability in identify boundary points, the inability in identify adjacent clusters of different densities, the sensitivity in starting points and the selection of two specified parameters. To solve these two problems, we design two strategies in this paper. The specific research works are summarized as follows:(1) Improved locality-sensitive hashing method based on p-stable distributionThe traditional locality-sensitive hashing methods limit the overall performance of the algorithms in time consumption and space consumption, this paper improve the performance of this algorithm by reducing count of mapping the query point. The traditional p-stable locality-sensitive hashing method firstly compute points which is in the same hashing bucket as the query point, then hashing points of these points are considered as candidate points of the nearest point of the query point, finally compute the distance between the query point and candidate points to obtain the result. In our paper, we select points which are in the same bucket with mapping points of the query point as candidate points. This strategy expands the scope of candidate points to increase the probability of finding the nearest neighbor object. Even we reduce the number of hash tables, the performance of our algorithm will not be affected.(2) Improved DBSCAN algorithmAlthough the traditional DBSCAN algorithm can detect different shapes of clusters, it also has inevitable defects as mentioned above. On the one hand, we take advantage of influence space to replace the ε-neighborhood. The size of ε-neighborhood of each object depends whether the current point is core point or not. Another advantage of introducing influence space is reducing the number of parameters and successfully identifying adjacent clusters of various densities. On the other hand, a new concept---core density reachable is carried out. The optimized algorithm completes the detective of core points in the expansion of the cluster stage, while the detective of border points and noisy points is finished after all core points have been identified. Based on the above mechanism, the probability of border points mistaken as noisy points will be greatly reduce.
Keywords/Search Tags:locality-sensitive hashing method, DBSCAN algorithm, influence space, core density, reachable
PDF Full Text Request
Related items