Font Size: a A A

Research On Spectral Clustering Algorithm Based On Nearest Neighbor Analysis

Posted on:2021-05-13Degree:MasterType:Thesis
Country:ChinaCandidate:W M TianFull Text:PDF
GTID:2518306038986899Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
In the current environment of the information age,what we have to face is the problem of summarizing and organizing the massive data generated from the information flow.In the process of high-speed information generation,transmission,and preservation,a large amount of data information will be generated,and as people's requirements for information accuracy continue to increase,the amount of information will increase significantly,and its dimensions will double.These changes not only make it more difficult to extract valid information from massive databases,but also make it impossible to mark the data set in advance.Therefore,effective information extraction has become an inevitable trend in the development of the information age.Clustering is one of the popular methods of unsupervised learning,in recent years,spectral clustering becomes a hot topic in the research of clustering methods,and usually shows good clustering performance.Spectral clustering generally has the following core steps:first,a graph and a similarity matrix were constructed to represent the data set.Second,the corresponding regularized or irregular Laplacian matrix was obtained from the similarity matrix,and then,the eigenvalues and their corresponding eigenvectors of the Laplacian matrix were obtained via the eigen-decomposition.Finally,the feature vectors were selected to form a feature space,and an algorithm(such as the K-means algorithm,etc.)is used to obtain clustering results in the feature space.Among them,how to construct the similarity matrix is the key step in spectral clustering which mainly determines the performance of clustering.However,the traditional algorithm models cannot consider the distribution structure of the data set very well,making them difficult to truly reflect the similarity between the data points.In the full connection graph species,the relationship between samples is obtained directly based on each independent sample,and the established sample metric is also based on the global especially for Euclidean distance frequently used in them.So,the correlation between samples is relatively single,especially for multi-scale data sets and data sets with manifold structure,making them difficult to fully describe the local features between samples.Therefore,the established similarity feature description is incomplete and cannot fully reflect the real relationship between samples and the potential data structure.To solve the above problems,in this paper,we study the spectral clustering algorithm based on the neighbor graph analysis,the main contributions are summarized as follows:First,inspired by density clustering,we found that optimizing the similarity matrix by finding the nearest neighbor relationship between data can reflect the true neighborhood relationship between the data points very well,and developed a new method for constructing similar matrix,in which,instead of traditional k-nearest neighbor map or fully connected graph,the k-neighborhood relationship and mutual neighbor relationship were considered to excavate the local and global structure of the data.By obtaining the weight relationship matrix to update the similarity matrix,the spectral clustering effect is improved.Furthermore,regarding the clustering results,we propose a new test method that can be used to test whether the cigengap value becomes larger,because according to the matrix perturbation theory,it can be concluded that the larger the eigengap value,the more stable the matrix partitioning.So,a novel model to make the eigengap value becomes larger was constructed to improve the ability of the proposed model to extract the density and connectivity information contained in the data set.Second,combined with the weight relationship matrix,a new self-adjusting Gaussian kernel function is proposed to improve the performance of the proposed method further,in which the parameters of Gaussian kernel function were determined adaptively and the weight relationship measurement is taken into account.Instead of the difficulty of manual parameter selection in the Gaussian kernel function and the limitation on the data samples brought by the Gaussian kernel function that was originally calculated only using Euclidean distance,which can accurately extract the manifold structure from multi-scale data.The performance of the proposed method has been verified on artificial data sets,UCI data set and MNIST handwritten digit data set.Several existing clustering algorithms were used for compare.The experimental results show that proposed algorithm yields a better performance for the artificial and real data sets.Moreover,it is also shown that the proposed method can more accurately extract the manifold structure from multi-scale data,and less sensitive to selection of the Gaussian kernel parameters.
Keywords/Search Tags:Euclidean distance, similarity matrix, K-neighborhood, K-means
PDF Full Text Request
Related items