Font Size: a A A

Research On Application And Optimization Of Density Peak Clustering

Posted on:2021-05-10Degree:MasterType:Thesis
Country:ChinaCandidate:Y N ZhangFull Text:PDF
GTID:2428330623969895Subject:Statistics
Abstract/Summary:PDF Full Text Request
In recent years,with the rapid development of big data artificial intelligence technology,people's production and lifestyle have undergone tremendous changes.It is mobile payments,webcasts,video broadcasts,bicycle-sharing,online shopping and artificial intelligence that have infiltrated every aspect of people's lives and become the hottest topics today.The amount of data generated has also exploded.Cluster analysis,as an important branch of data mining technology,is also an important technology in the discipline of statistical analysis and an unsupervised machine learning method.It can independently explore the inherent structural information and similarity relationship of datas without any prior information,which makes datas in same type of cluster have greater similarity,and datas of different clusters have large differences.Therefore,we can use cluster analysis to mine unorganized but valuable informations contained in these massive data,and conduct relevant practical research for promoting social development.Alex Rodriguez et al.proposed a clustering algorithm based on fast search and find of density peaks(DPC)in Science in 2014.The algorithm redefines the concept of class centers,and maps data to a two-dimensional space(local density and closest distance).Then,class centers are identified and classes are grouped on a new space.DPC algorithm can quickly find density peak points of any shape data sets,and can efficiently allocate sample points and remove outliers.Thus,since it was proposed,the algorithm has been applied in many fields,such as community discovery,image processing,computer vision and text processing.It has been widely recognized by all walks of life.However,with the in-depth study of DPC algorithm,it also exposed some application deficiencies.There is no uniform density metric for the algorithm,and the parameter dc is difficult to determine directly.Besides,the clustering center needs to be manually selected,and sample allocation results misassigned is prone to continuous transfer.And,other flaw is that it is inability to effectively deal with complex flow patterns and density-differentiated data sets.So,for better application results,this paper proposes two clustering optimization algorithms,and applies new algorithms to text mining research of electronic medical records.First of all,when processing density-difference data,the density peak clustering algorithm cannot effectively measure density peaks of data points located in low-density regions,and incorrectly classifies sparse and low-density clusters into dense and highdensity clusters.This paper proposes a density peak clustering algorithm based on relative density optimization,which redefines the local density of sample points and the allocation method of the remaining points.It solves the problem of identifying low-density regions,and expands the research method on density peak clustering.Secondly,the clustering algorithm for density peaks cannot effectively identify cluster centers when processing multi-density or complex flow pattern data,and incorrectly split one cluster or merge two clusters.Inspired by density peak clustering algorithm and DBSCAN algorithm,the local density of sample points is redefined by using shared neighbors,and cluster analysis is performed with the help of the idea of DBSACN algorithm(identification of core points and the connection of neighbors).In addition,a nonparametric statistical test is attempted to merge subclasses.Based on this,a clustering algorithm based on shared neighbors and statistical tests is proposed,which effectively compensates for the shortcomings of density peak clustering that cannot effectively handle complex manifold data.Finally,in the context of rapid rise of Internet medical industry and digitalization of hospitals,paper medical records are gradually being discarded,and thus leading to accumulation of massive electronic medical records.Classic DPC algorithm and the improved algorithm in this paper were used in text mining of electronic medical records to verify Effectiveness of the optimized algorithm in cluster analysis of text mining.Our goal is to achieve the analysis of hospital's accumulated text data of electronic medical records,and find underlying disease characteristics and corresponding diagnosis and treatment modes.This paper proposes two optimized algorithms based on density peaks for data with variable density and data with irregular structure or complex manifolds.The experimental results show that our algorithms retain the advantages of DPC algorithm and other algorithms like DBSCAN and the idea of statistical testing.The optimizated algorithm greatly improves clustering accuracy and parameter robustness.In text mining of electronic medical records,the optimizated algorithm in this paper completes the clustering work well,which is of great significance for improving efficiency and level of clinical diagnosis and treatment.The innovation of this paper is manifested in three aspects: First,in view of the poor performance of density peak clustering algorithm when processing data with variable density,multi-density and complex flow patterns,two density peak optimizated algorithms are proposed.It better solves the problem of low-density regions and complex manifold data,and extends the research method of density peak clustering.Second,due to the particularity and complexity of cluster analysis,it lacks significance tests,and the analysis process is not complete.So,by drawing on the advantages of other algorithms and the idea of statistical testing,this paper attempts to use non-parametric tests in subclass merger of cluster analysis,which has achieved good results,and provides a new perspective and method for cluster analysis.Third,the improved algorithm is applied to text mining of electronic medical records in order to realize the analysis of text data of electronic medical records accumulated in the hospital,and find underlying disease characteristics and corresponding diagnosis and treatment modes.It is important and significance for improving efficiency and level of clinical diagnosis and treatment.
Keywords/Search Tags:peak density, relative density, manifold clustering, statistical test, electronic medical record
PDF Full Text Request
Related items