The main task of clustering is to partition the dataset into several groups without prior knowledge of the dataset,aiming to minimize the inter-cluster similarity and maximize the intra-cluster similarity.For instance,we may not know the number of cluster or the distribution of the data beforehand.As an important branch of data mining,cluster analysis is widely used in many research fields,such as multimedia classification,image segmentation,bioinformatics,etc.Clustering algorithms can be categorized into different paradigms according to different schema: partitional clustering,hierarchical clustering,density-based clustering,grid-based clustering,model-based clustering,graph theory-based clustering.Density-based methods have been a hot research direction because of their good interpretability and ability to discover clusters of arbitrary shapes.DBSCAN is a classical density-based clustering algorithm,which has attracted many scholars' attention.However,the time complexity problem,especially the time cost of searching the nearest neighbors,which leads to the inefficiency of the algorithm when the data set is large or the dimension is high,has always been one of the main problems of DBSCAN.Locality sensitive hashing algorithm is an effective means to solve the problem of neighborhood retrieval.We can quickly find the nearest neighbors of a data point from a large number of high-dimensional data sets with the help of LSH index.Aiming at the efficiency problem of DBSCAN algorithm,this paper proposes two improved algorithms,LSH-DBSCAN and LSHSNN-DBSCAN which combine with locality-sensitive hashing,to improve the efficiency of DBSCAN algorithm from two directions of "quantity" and "dimension".The specific research results are as follows:(1)This paper elaborates the idea and mathematical principle of LSH in detail.Then it describes the process of constructing LSH index.At last,based on the analysis of the experimental results,it is concluded that the local sensitive hash index can quickly find the candidate set of the nearest neighbor points of data points(2)The LSH-DBSCAN algorithm based on LSH index is proposed to improve the efficiency of DBSCAN algorithm from the perspective of quantity reduction.In order to solve the time-consuming problem of traditional DBSCAN algorithm in searching epsilon neighbourhood because it needs to build distance matrix for all data,inspired by the structure of LSH index,LSH-DBSCAN divides the data objects which appear in the same index many times in the hash table into the same sub-cluster.Then LSH-DBSCAN selects the representative points in different sub-clusters.After that the traditional DBSCAN algorithm is run only on the representative points.Finally,the points in the same sub-cluster as the representative point are marked as the same cluster as the representative point.Therefore,the DBSCAN algorithm used for clustering only needs to run on the set of representative points instead of the whole data set,which avoids spending a lot of time traversing the whole data set when searching for neighborhoods which improves the efficiency.The experimental results show that the proposed LSH-DBSCAN algorithm can improve the efficiency while ensuring the accuracy of clustering results.Compared with other experimental algorithms,the LSH-DBSCAN algorithm has obvious advantages when the amount of data is large(3)The LSHSNN-DBSCAN algorithm based on LSH index is proposed to improve the efficiency of DBSCAN algorithm from the perspective of dimensionality.In order to solve the problem that high-dimensional data spend a lot of time in calculating the similarity of data objects,and the traditional Euclidean distance is meaningless in high-dimensional space,the two main improvements of LSHSNN-DBSCAN algorithm proposed in this paper are as follows: Firstly,related research has proved that using shared neighbors can effectively weaken the impact of "dimension disaster" in high-dimensional data.In this paper,we use the intersection of neighbor candidate sets of data objects to express the similarity between them,which reduces the impact of caused by that Euclidean distance become meaningless in high-dimensional space and reflects the distance between objects better.Secondly,the algorithm for searching neighborhood is improved.The traditional DBSCAN algorithm compares the entire data set when searching for neighborhoods.In LSHSNN-DBSCAN,only the near neighbor candidate set of this point need to be searched when querying the epsilon neighbourhood(that is,the near neighbor set searched by LSH index),instead of the whole data set.Thereby improving the search speed of the epsilon neighborhood.The experimental results show that the proposed algorithm improves the clustering accuracy and efficiency in most datasets compared with other experimental algorithms and has certain advantages. |