Font Size: a A A

Research On Kernel--based Hierarchical Clustering Algorithm

Posted on:2022-03-27Degree:MasterType:Thesis
Country:ChinaCandidate:X HanFull Text:PDF
GTID:2518306323455174Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Clustering is an important and useful tool in data mining and knowledge discovery.Due to the fast calculation speed of the hierarchical clustering algorithm,the output result is easy to interpret,and it has been widely used in various fields.This paper proposes an optimized and improved method for the hierarchical clustering algorithm for different usage scenarios(static data and streaming data)of hierarchical clustering.Among static data hierarchical clustering methods,the most widely used is agglomerative hierarchical clustering(AHC).The existing AHC method based on distance measurement has a key problem: no matter what method is used to extract the clustering results from the dendro-gram,it is difficult to separate adjacent clusters with different densities.For static hierarchical clustering,this theis determines the root cause of the above-mentioned problems caused by ex-isting AHC methods? The concept of entanglement is introduced to explain the merging process that leads to poor quality dendrograms.And put forward two indicators,namely correcting The entanglement number and the average entanglement level,which proves to be an objective mea-sure of the height of the dendrogram called the purity of the dendrogram.It is shown that using data-related kernel functions(rather than distance metrics)is an effective method to solve this problem? proposes to(For example,the existing traditional AHC algorithms,HDBSCAN,GDL and PHA)use isolated cores,which are data-related cores,for coreization? In each algorithm,experimental evaluation shows that compared with distance,Gaussian kernel and adaptive Gaus-sian kernel,using Isolated nuclei produce higher quality or purer cluster trees.When clustering streaming data,the existing hierarchical clustering algorithms usually en-counter problems such as low scalability and inability to overcome rigidity,and it is difficult to effectively process large-scale data sets in real time.Aiming at hierarchical clustering of streaming data,this paper introduces kernel function-based set similarity in hierarchical clus-tering for the first time,and adjusts it so that the proposed algorithm has the ability to capture the dynamic similarity between new samples and detect clusters with different densities ? A ker-nel function suitable for data correlation.The incremental hierarchical clustering tree update strategy enables the algorithm to continuously maintain high-quality dendrograms in real time.An efficient hierarchical structure update mechanism(efficient new data insertion and old data deletion algorithms)KERCH algorithm is prop,which can continuously maintain high-quality hierarchical clustering trees in real-time in streaming data scenarios? on multiple benchmark data sets.Experimental results show that KERCH is more accurate and faster than other scalable hi-erarchical clustering algorithms.
Keywords/Search Tags:Cluster analysis, Hierarchical clustering, Kernel function, Isolation kernel, Stream data clustering
PDF Full Text Request
Related items