| In the data mining,the clustering algorithm can mine valuable information from a large amount of data without learning.The spectral clustering algorithm is a classical clustering algorithm,its essence is to transform a clustering problem into an undirected graph cutting problem.The spectral clustering algorithm can deal with complex non-convex datasets and is not easy to fall into local optimization.However,the conventional spectral clustering algorithm uses the Gaussian kernel function based on Euclidean distance to calculate the similarity between samples.The spectral clustering algorithm is not only sensitive to the kernel parameters,but also can not correctly reflect the relationship between samples.Aiming at defects of similarity measure method and low computational efficiency in conventional spectral clustering algorithm,the conventional spectral clustering algorithm is optimized to improve the clustering effect,the optimized spectral clustering algorithm is parallelized to improve the efficiency of the optimized spectral clustering algorithm for processing massive data.The main works of this thesis are as follows.(1)Aiming at the problem that the calculation of similarity between samples in conventional spectral clustering algorithm not only depends on the setting of kernel parameters,but also can not correctly reflect the relationship between samples,an adaptive density-sensitive similarity measure based spectral clustering(DSSC)algorithm is proposed to improve the clustering effect.Firstly,the Euclidean distances between samples are calculated to get the nearest neighbors of each sample.Secondly,the standard deviation of distances between each sample and its nearest neighbors is calculated as the density parameter.Thirdly,the density-sensitive distances between each sample and its nearest neighbors are calculated.Finally,the similarities between each sample and its nearest neighbors are calculated to construct a similarity matrix.A series of experiments are conducted to verify the effectiveness of the proposed DSSC algorithm on several synthetic datasets and UCI datasets.(2)To improve the efficiency of the proposed DSSC algorithm in processing large-scale datasets,the DSSC algorithm is parallelized by making full use of the CPU and GPU resources in Dask+CPU/GPU distributed parallel computing platform.Firstly,the similarity matrix is constructed on CPU of each worker node in parallel,and the obtained similarity matrix is copied from CPU to GPU.Secondly,the degree matrix and the normalized Laplacian matrix are constructed on GPU of each worker node in parallel.Thirdly,the eigen-decomposition of the normalized Laplacian matrix is performed and the appropriate eigenvectors are selected to construct a new matrix on GPU of each worker node in parallel.Fourthly,the K-Means clustering algorithm is performed on GPU of each worker node in parallel,and the clustering results are copied from GPU to CPU.Finally,the clustering results are gathered from each worker node to the master node.The experimental results indicate that the parallel DSSC algorithm can make full use of the CPU and GPU resources in the Dask cluster to improve the efficiency in processing large-scale datasets.(3)When the parallel DSSC algorithm is executed to process largescale datasets in a Dask+CPU/GPU cluster,it is necessary to chunk the datasets,and the block size has a large impact on the efficiency of the parallel DSSC algorithm in processing large-scale datasets.Therefore,a dynamic data blocking strategy based on local weighted linear regression is proposed.Firstly,a large-scale dataset to be processed is divided into a set of sub-datasets to find optimal block size and a set of remaining subdatasets to be processed.Secondly,the block size of each sub-dataset to find optimal block size is reasonably set,and each sub-dataset to find optimal block size is chunked.Thirdly,each sub-dataset to find optimal block size is processed sequentially in the Dask+CPU/GPU cluster.Finally,according to the block size and time consumption of each processed sub-dataset,the local weighted linear regression algorithm is used to accurately and dynamically estimate the block size corresponding to each sub-dataset to be processed online.The experimental results show that this strategy further improves the efficiency of the parallel DSSC algorithm in processing large-scale datasets in the Dask+CPU/GPU cluster to a certain extent. |