Font Size: a A A

Research On Reciprocal Nearest Neighbors Based Hierarchical Clustering Algorithm And Its Application

Posted on:2022-05-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:W B XieFull Text:PDF
GTID:1488306524471064Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The spread of smart devices and the development of storage technology have made it possible to collect and store vast amounts of data.By analyzing and discovering these data,we could obtain good economical and social benefits.With the development of the digital revolution in various industries,it is becoming more and more difficult to get insight into large-scale data in different fields.The hierarchical clustering algorithm,as a powerful tool of big data analysis and mining,plays an important role in organizing data into a multi-level and multi-resolution structure,which helps people to discover the useful knowledge hidden in the data.Traditional hierarchical clustering algorithms,however,can only be applied in some limited tasks because of their poor scalability.Therefore,for large-scale data sets,the trade-off between the time cost and the empirical performance is a long-term challenge.In this thesis,we propose a hierarchical clustering framework based on the reciprocal nearest neighbors,and perform some in-depth study in this framework,including the following aspects:(1)A novel hierarchical clustering algorithm is proposed,named the Reciprocalnearest-neighbors Supported Clustering(short for RSC).RSC algorithm is a novel clustering framework that is different from the classical aggregation clustering(HAC)and the CF-Tree-based incremental clustering.The fundamental assumption of the RSC algorithm is na(?)ve and elegant: two data points that are a pair of reciprocal nearest neighbors(short for RNNs)should be assigned into the same cluster.Based on this assumption,the RSC algorithm aggregates data points and detects RNNs pairs by searching data points' nearest neighbors,and thus,treats the centroid of an RNNs pair as an artificial root for further clustering,which finally arranges all data points into a multi-level and multi-resolution cluster tree.RSC algorithm also contains a cluster-size-based pruning strategy to solve the chain effect suffered by common hierarchical clustering algorithms.Extensive experiments on real-world data show that the RSC algorithm has excellent empirical performance since the RSC algorithm gets the best Rand Index in most cases.Moreover,time-complexity analysis shows that the time-complexity of the RSC algorithm is O(n log n)that is superior to most classical hierarchical clustering algorithms.(2)The RSC algorithm is optimized to deal with the potential complex data,and extended to solve the community detection problem.Community detection,as a special kind of clustering on complex network data,has been widely applied in many domains.However,due to the sparsity and asymmetry of complex networks,the traditional clustering algorithms are unable to be directly used for community detection.In this thesis,the random walk method is introduced to estimate the distances between nodes(including unweighted networks).The approximated distances between artificial roots are formulated by the geometric relation in Euclidean space.By calculating the betweenness of potential edges among communities,communities are fulfilled into a complete cluster tree.Comparison experiment on real-world networks shows that the RSC algorithm is significantly superior to the classical Girvan-Newman method,indicating that the RSC algorithm could be well extended to community detection applications.(3)The space-complexity of the RSC algorithm is further reduced for practical application.We propose a modified RSC algorithm based on scoring the importance of RNNs nodes,called the SRSC algorithm,which utilizes representatives in the sub-minimumspanning trees for aggregation.Considering that the extra memory cost is caused by the artificial roots,the SRSC algorithm formalizes the concepts in RSC by applying the graph model,and then,estimates the importance of RNNs nodes from the structure-based importance as well as the relative-position based importance which is realized by topological structure scoring and data boundary sampling,respectively.The importance assessment ensures that the cluster tree would get a more balanced structure,thus avoiding the cost of extra pruning,and makes SRSC a parameter-free algorithm.Extensive comparative tests on real-world data sets show that the accuracy of SRSC is superior to the state-ofthe-art hierarchical clustering,PERCH.Meanwhile,the complexity analysis shows that SRSC reduces the space-complexity O(log n)while maintaining the time-complexity to O(n log n).(4)A data paralleled hierarchical clustering algorithm is proposed,named Election Tree.Election Tree is designed to solve the problem of how to balance the time cost and the empirical performance in large-scale data clustering.The main idea of Election Tree is to perform aggregation on data segments,which regard the candidates(the root nodes of clusters)as new nodes to merge clusters and aggregate iteratively.Based on “Relative Position Factor”,the Election Tree algorithm performs Cross-Sampling to detect nodes on the data boundary,which makes the distribution of boundary points more uniform.Besides,the Election Tree algorithm also performs Swap to exchange leaves in the overlapping regions,which corrects the attribution error of leaves and improves the accuracy of the clustering result.Comparative experiments on the real-world data show that the performance of Election Tree is significantly better than the classical algorithms and the state-of-the-art algorithm PERCH,and the results are not sensitive to the parameters.The scalability tests on various types of artificial data show that the CPU Time growth rate of Election Tree is about 1.18,which reaches a quasi-linear level and is insensitive to data distribution.The Election Tree algorithm performs clustering in data segments that do not need shared memory,and there is no significant correlation between the size of segments and the clustering results,indicating that Election Tree will have a good application prospect in large-scale data clustering applications.In conclusion,this thesis puts forward a hierarchical clustering framework based on the reciprocal nearest neighbors.And many pieces of research based on this framework are also performed,including the application extension on community detection,the solution of high space-complexity problem,and the paralleled clustering for large-scale data.Extensive experiments verify the empirical performance and scalability of the proposed algorithms,indicating that the researches on this new branch of the hierarchical clustering provides a meaningful exploration for how to balance the empirical performance and scalability in the hierarchical clustering,and provides a new tool for structural analysis and data mining.
Keywords/Search Tags:machine learning, hierarchical clustering, reciprocal nearest neighbors, data mining, big data
PDF Full Text Request
Related items