Font Size: a A A

Research On Multiple Density Clustering Algorithm

Posted on:2021-05-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y Z WangFull Text:PDF
GTID:1368330623977244Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Multiple/varied density clustering is one of the challenging topics in machine learning,which has low complexity,strong interpretability and easy visualization,the family of multi-density clustering algorithms are widely used in many fields,such as bioinformatics,financial data processing,image processing,video processing,etc.However,there are still two kinds of limitations: it is difficult to obtain high-quality clustering results of multi-density structure datasets and the performances are heavily dependent on parameters settings.In this thesis,we focus on previous two issues,and then propose the local density(the number of data points within a certain cutoff Euclidean distance)as a tool of data structure analysis,which divides all instances into different density levels to achieve multi-density data clustering.Simultaneously,an automatic clustering algorithm based on minimizing the variance of regional density between clusters is proposed.We also apply multi-density clustering algorithm to single-cell RNA data processing,image segmentation,face recognition and other tasks.Specifically,the contributions are as follows:(1)First of all,we improve the well-known benchmark called density peaks clustering(DPC)algorithm,and propose a novel method,namely Multi-center Density Peak Clustering(McDPC).McDPC has better generalization ability than DPC,and it is able to effectively identify clusters with multi-density structure.McDPC addresses two issues of DPC,which are that DPC may not effectively identify clusters with multiple density peaks(multi-centers)or in the low-density areas.Specifically,all the data points fall into different levels according to their local density().Then data points in each level will be treated in different strategies.This process is proposed to identify the clusters in the low-density areas.At the same time,McDPC operates another parameter with similar procedure,which is used to identify clusters with multiple density peaks.In order to verify the effectiveness of McDPC algorithm,six synthetic datasets and six real-world UCI datasets are used,and we also apply McDPC algorithm to two clustering tasks: image segmentation and face recognition.The experimental results show that McDPC has better performance in various clustering tasks,and can effectively identify clusters with multiple density peaks and located in low-density areas.(2)Secondly,a systematic density-based clustering method using anchor points(APC)is proposed.APC effectively synthesizes DPC's ability to identify outliers and DBSCAN's ability to identify clusters with the uniform density,and puts them into a unified framework,which overcomes their difficulty in identify clusters with multidensity.Specifically,APC divides all data into different density levels,and analyzes the influence of outliers and connection points(connected points refer to the data points that locate near the boundaries of the natural clusters and they are ambiguous in the decision of which clusters they belong to)on the clustering results of multi-density data,and then autonomously gives different clustering strategies according to the distribution of different density levels.Specifically,APC first extracts the outliers of the dataset,and then divides the remaining samples into different density levels,which adopt different density clustering strategies.In order to verify the effectiveness of the proposed APC algorithm,we select 12 synthetic datasets,8 UCI real-world datasets,and face recognition dataset.The experimental results show that the clustering effect of APC algorithm is better than other algorithms,especially DPC algorithm,and can effectively deal with multi-density clusters.Compared with McDPC algorithm,APC has better generalization performance and can identify more multi-density datasets.(3)Although McDPC and APC have achieved good results in dealing with multidensity dataset clustering tasks,both McDPC and APC algorithms have more parameters,which makes the parameter adjustment process more complex,and it takes a certain amount of time to obtain better clustering results.In this paper,an adaptive density clustering algorithm(DPADC)is proposed to solve this problem.DPADC is an exploratory study of the nonparametric density clustering algorithm in this thesis.It uses an objective function based on regional density to merge small clusters,so as to generate better clustering results without parameters.Specifically,DPADC algorithm is divided into two stages,the first stage is to generate micro-clusters,the second stage is to merge micro-clusters: local merge and global merge.Local merging is determined by the difference between between-class distance and within-class distance,and global merging is determined by whether the regional density variance of merging classes is reduced.Four synthetic datasets and four UCI real-world datasets are used to test the effectiveness of DPADC algorithm.DPADC can achieve better results without parameter adjustment.(4)In the practical application of multiple clustering algorithm,this paper proposes a matching cluster structures-based clustering algorithm(MCSC),which is applied to single cell RNA sequence data processing.Specifically,firstly,two groups of clustering results are generated by using k-means,each group of clustering results is composed of different intermediate clusters;secondly,these intermediate clusters are divided into micro-clusters and core clusters;finally,the relationship between microclusters and core clusters in high-dimensional space is described by using shared nearest neighbor,and a consistency objective function is proposed based on normalized mutual information,the solved process is the best assigning process of micro-clusters.This algorithm can improve the accuracy of density clustering algorithm.In this part,five real single-cell RNA datasets are used to test the effectiveness of MCSC algorithm.Experiments show that MCSC algorithm can effectively deal with RNA-sequencing data with higher dimension and smaller samples.In a nutshell,this thesis systematically researches the related problems of multidensity clustering,and proposes four novel multi-density clustering algorithms.In particular,APC algorithm can identify 12 synthetic data sets which are often used to verify clustering performance.The main theoretical contributions of this paper are as follows: first,we propose that local density is used as a tool for data structure analysis.A potential dataset can be divided into multiple density levels through local density,and the subsequent clustering will be simpler and more efficient.Based on this theory,we propose two multi-density clustering algorithms,namely McDPC and APC.Secondly,we propose an adaptive clustering method based on regional density.
Keywords/Search Tags:Multiple density clustering, density peak, density levels, parameterfree clustering
PDF Full Text Request
Related items