Font Size: a A A

Research And Application Of Density Peak Clustering Algorithm

Posted on:2019-09-25Degree:MasterType:Thesis
Country:ChinaCandidate:W L JiangFull Text:PDF
GTID:2438330548465032Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Clustering is an unsupervised learning method.It can cluster data points into clusters based on the intrinsic structure of data without any knowledge,so that it can discover the potential distribution of data.Clustering has been paid much attention by researchers,especially in the data explosion era.It has been widely applied to data mining,machine learning,pattern recognition,image processing,biomedical data analyses and so on.Density based clustering algorithm can recognize arbitrary shape clusters according to the distribution information of data,and has been applied to many fields.DPC(clustering by fast search and find of density peaks)is a new density based clustering algorithm proposed in 2014.DPC can find density peaks and any arbitrary shape clusters within any dimensional space by assigning a point to the same cluster as its nearest neighbor with higher density.It can find the potential distribution of points in a data set efficiently.However,the cutoff distance must be given empirically,and for different size of datasets it uses different density measurement,and the density peaks of a data set must be determined manually.Furthermore,it can lead fatal problem similar to the Domino effect,that is once a point is assigned to an error cluster,then there may be many more data points will be assigned erroneously,such that it performs very poor in finding the clustering of some datasets,especially the dense and sparse clusters coexist in one datasets or there are heavy overlap in the dataset.How to advance DPC to find the clustering of a dataset adaptively while with less number of parameters for any kind of data sets has become the key issues to be solved.Feature selection is divided into supervised and unsupervised feature selection methods by the labels of exemplars are used or not in the feature selection process.The supervised feature selection methods use label information while the unsupervised methods do not.Feature selection is a very important data preprocessing method.It has been widely used in medical data analysis,images processing,text processing and many other fields because the data in these fields are always having small number of points while with high dimensions and redundant and less contribution features.To eliminate redundant and less important features of high-dimensional data,can not only reduce the classification time,but also improve the classification accuracy.Therefore,feature selection has become the first and key issue for processing and analyzing high-dimensional data.This thesis will devote to solve the aforementioned issues of DPC algorithm,and use the proposed algorithms to gene expression data analysis to detect key gene related to some cancers.The main works and innovations are as follows:1.A density peak clustering algorithm is proposed based on local standard deviation.The method of density measurement is improved by reference to local standard deviation of a point,so that embodying the local information of how dense the local area is around the point.The proposed new algorithms were tested on the real datasets from UCI machine learning repository,and the synthetic datasets and some gene expression data sets.The experimental results demonstrated that the proposed density peak clustering algorithm based on local standard deviation can not only effectively find the cluster number of data sets,but also recognize the arbitrary shape and density distribution clusters.2.A new density peak clustering algorithm is proposed which can automatically find both the number of clusters and the clustering of a dataset.The algorithm selects the obviously discontinuous position of γi sequence in decision graph.The first i points are selected as cluster centers.A new assignment strategy and a merging strategy are proposed to solve the fatal problem by efficient one step assignment strategy of DPC.The algorithm is tested on several uneven and challenging synthetic data sets.The performance of the proposed new algorithm is evaluated in terms of clustering accuracy(Acc),adjusted mutual information(AMI)and adjusted rand index(ARI),and compared to the performance of available DPC and its variations.Friedman’s statistical test has been done to verify the power of our proposed advanced DPC algorithm.The results show that the clustering performance of the proposed algorithm is superior to the existing density peak clustering algorithms,and there are significant differences with the available several clustering algorithms.3.The feature selection algorithms are proposed based on density peaks clustering algorithms.The genes are clustered and cluster centers are selected to construct the gene subsets.The exemplars only with genes from selected gene subset are clustered.The performance of the clustering results are evaluated in terms of clustering accuracy shorted as Acc,Rand index and Jaccard coefficient,which value the gene subset simultaneously.The experimental results show that the proposed algorithms are effective to detect the powerful gene subsets.Although the clustering results can’t recognize all patients,it is still an unsupervised feature selection algorithm with comparable well performance.
Keywords/Search Tags:Clustering, density peak, standard deviation, DPC, significant tests, unsupervised feature selection
PDF Full Text Request
Related items