Font Size: a A A

Research And Application Of Spectral Clustering Algorithm

Posted on:2020-06-11Degree:MasterType:Thesis
Country:ChinaCandidate:L J DingFull Text:PDF
GTID:2438330602462394Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Clustering is an unsupervised learning method.It can discover the true distribution of data without any empirical information.This feature makes clustering methods receive significant attention.Spectral clustering is a graph-based algorithm,it has excellent clustering performance and is widely used Compared with traditional clustering,spectral clustering not only can discover arbitrary shape clusters,but also can converge to the global optimum.However,the parameter selection of the spectral clustering algorithm depends on the empirical value,and the clustering result is unstable.How to make the spectral clustering algorithm adaptively select parameters based on the original distribution of samples and obtain stable clustering results becomes a key problem in spectral clustering algorithm research.Feature selection is an important method for data preprocessing.It aims to eliminate redundancy and reduce features.It has been widely used in medical data analysis,text data processing and other fields.According to the labels of samples are to be used or not,the feature selection algorithms can be classified into two categories,such as supervised feature selection algorithms and unsupervised feature selection algorithms.In the data explosion era today,the unlabeled data with high dimensions and high noises accumulated very quickly.Therefore,the unsupervised feature selection algorithms have become the ad hoc focus area of feature selection algorithms.This thesis devotes to solve the following issues:the first is the parameters in self-tuning spectral clustering algorithm,which are affected by the outliers and depend on the empirical value,and the unstable clustering results of spectral clustering algorithms’ affected by K-means.The second is that unsupervised feature selection algorithms always select the feature subset with low classification accuracy;Finally is to find the characteristic features to diagnose and recognize the breast cancer patients.Therefore,a fully adaptive spectral clustering algorithm is present to avoid the influence on the parameters of the self-tuning spectral clustering algorithm by outliers and empirical values and the clustering results of it affected by K-means.A new feature selection algorithm is present based on spectral clustering to select those features with high recognition properties to construct the feature subset of unsupervised feature selection algorithm.Finally,a new algorithm for diagnosis and recognition of breast cancer patients based on spectral clustering is proposed.The main innovations and works are as follows.1.Two spectral clustering algorithms are proposed and named as SC_SD(Spectral Clustering based on Standard Deviation)and SC_MD(Spectral Clustering based on Mean Distance)respectively.The standard deviation of point i,and the mean distance from point i to other points are defined as the radius of the neighborhood of point i by SC_SD and SC_MD,respectively.The number of points in the neighborhood is counted,and the standard deviation of point i in the neighborhood is adopted as its local scaling parameter,so as to avoid the influence from outliers to the local scaling parameter σi of point i,and further to the distortion in clustering results of self-tuning.SD_K-medoids are adopted to instead K-means in self-tuning to avoid the unstable clustering results of K-means,so as to get the true clustering of a dataset by SC_SD and SC_MD.The experimental results on UCI datasets and on synthetic datasets demonstrate that SC_SD and SC_MD can obtain better clustering results than that of traditional spectral clustering algorithm NJW and of self-tuning spectral clustering algorithm,and are robust to noises,and can get good scalability.The proposed SC_SD and SC_MD can detect the clustering of a dataset without any given information.The SC_MD can detect the clustering of a comparable big dataset.2.The unsupervised feature selection algorithms based on spectral clustering are proposed,and named as FSSC for shorted.FSSC is to solve the feature selection problem for gene expression datasets with tens to thousands genses or features and small samples.All features are grouped into clusters by a spectral clustering algorithm,so that similar features are grouped into same clusters.Then the concepts of the feature discernibility and feature independence are proposed and defined,and the feature importance is defined as the product of the feature’s discemibility and independence.After that the representative features from each cluster are selected to construct the optimal feature subset.According to the spectral clustering algorithms used in FSSC,three kinds of unsupervised feature selection algorithms named as FSSC-SD,FSSC-MD and FSSC-ST are developed.The classifiers used in the experiements are SVM and KNN.Experimental results on 10 gene expression datasets show that FSSC-SD,FSSC-MD and FSSC-ST algorithms can select features that are strongly correlated to classification tasks.3.A new algorithm for diagnosing and recognizing the breast cancer patients based on spectral clustering is proposed.In order to detect the key feature subset of breast cancer and improve the recognition accuracy of breast cancer,a new algorithm is proposed to detect the features recognizing the breast cancer patient,named as SDAP(feature selection based on Standard Deviation And Pearson correlation coefficient).This algorithm defines feature discernibility and feature independence based on feature standard deviation and Pearson correlation coefficient,and uses the product of the feature discernibility and independence to measure the importance of the feature.Those features with much higher importance than others are selected to construct the feature subset for recognizing the breast cancer patients.The true self-adaptive spectral clustering algorithms SC_SD and SC_MD algorithm are used to do spectral clustering analysis for breast cancer data,where each sample is with only the selected features.The experimental results on three breast cancer datasets show that SDAP+SC_SD and SDAP+SC_MD methods can get good results for breast cancer patients.
Keywords/Search Tags:spectral clustering, neighborhood standard deviation, unsupervised feature selection, feature independence, feature discernibility
PDF Full Text Request
Related items