Font Size: a A A

Research On Bidirectional Hierarchical Clustering Algorithm Based On Grid

Posted on:2019-03-01Degree:MasterType:Thesis
Country:ChinaCandidate:R T DouFull Text:PDF
GTID:2428330548959196Subject:Engineering
Abstract/Summary:PDF Full Text Request
Clustering algorithms are important data analysis technology,and many applications use them to help researchers to establish an abstract model based on similarity to the research object.In general,Clustering analysis is an unsupervised process that divides a set of objects into homogeneous subsets which is based on the similarity between the research object.The clustering process only considers the inherent similarity between objects instead of additional information,So it is often used in areas without sufficient prior knowledge to complete the initial processing of data,such as gene expression,psychology,market research,images segmentation.With the popularization and application of clustering algorithms in various fields in the past two decades,a large number of various types of clustering algorithms have been researched to deal with different types and sizes of data.The processes of different clustering algorithms can be roughly classified follows types according to the definition of clustering clusters: grid-based clustering algorithm,density-based clustering algorithm,and distance-based clustering algorithm.Grid-based clustering algorithms have a strong processing power for big data by mapping data points into a grid and magnifying subsequent processing objects.One advantage of the density-based clustering algorithm is its ability to recognize clusters of arbitrary shape,which makes the density-based clustering algorithm very accurate.There are so many branches in the distance-based clustering algorithms,K-MEANS,EM and SOM are some of the classic algorithms of this category,and there are many new clustering algorithms such as FSFDP and entropy clustering.According to the difference of clustering results,clustering algorithms can be divided into two categories:partitioning-based clustering and hierarchical-based clustering.The former set clear boundaries to divide points into different clusters.The latter clusters data points layer by layer until a preset number of clusters or some other condition is met.Consider a variety of factors,the partitioning-based clustering is often used as a tool to reduce thesize of the data,and the hierarchical-based clustering is used to demonstrate the data structure.Although clustering algorithms can help us to deal with conventional problems.But with the proliferation of information technology and the popularity of big data,new data types and clustering requirements which generated by modern applications are still challenging the existing algorithms.Most clustering algorithms do not work effectively and efficiently in high-dimensional space,which is due to the so-called“curse of dimension”.They need big storage and hugely I/O consuming.In addition,the high-dimensional data often contains a significant amount of noise which causes additional effectiveness problems.In this paper,a bidirectional hierarchical clustering algorithm based on grid is studied.The basic idea is to use the kernel density estimation of large window width to make multilevel coarse-grained grid and to combine any clusters according to a distance definition based on statistical information.In this algorithm,a hierarchical structure is established through two stages,namely,the top-down phase and the bottom-up phase.In the first stage,based on Opti Grid algorithm,the algorithm uses coarse-grained multilevel partitioning the data space without the fixed kernel density estimation parameters defined by the user.In addition,in order to improve the performance of the algorithm for big and high-dimensional data,a kind of SSC(soft subspace cluster)strategy is applied,that is a separate feature ordering strategy is used to subspace selection.In the second phase,grids obtained in the first phase is sorted according to the number in the grid,and the clustering is conducted according to the robustness distance based on statistical information.The algorithm in this paper has a good processing ability for big and high dimensional data,and it is stable to noise and outliers.The experiment shows that the algorithm performs well in all kinds of data sets,and the performance exceeds many existing algorithms.
Keywords/Search Tags:Clustering algorithm, Hierarchical clustering, Grid clustering, Kernel density estimation
PDF Full Text Request
Related items