Font Size: a A A

Research On Clustering Methods For The Data With Large Number Of Clusters

Posted on:2022-10-05Degree:MasterType:Thesis
Country:ChinaCandidate:X R ShiFull Text:PDF
GTID:2518306509470114Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the explosion of unlabeled data in all walks of life,unsupervised learning has become an important task in the field of artificial intelligence and data mining.One of the most important task is cluster analysis,which can find the potential distribution of clusters from the massive data to provide valuable information.Relevant experts and scholars put forward many effective clustering algorithms from different perspectives.However,as the volume of data increases and the complexity of the scenarios we face increases,the recorded data often has a multi-cluster structure,such as the Avibase database,which records more than one million records of about 10,000 species and 22,000 subspecies of birds,covering distribution information,taxonomic information,different names in different languages,etc.With the proliferation of such data,ultra-multi-cluster clustering analysis brings challenges to traditional clustering in terms of both performance and efficiency.The performance problems include: the distribution of data clusters is complicated,the amount of data contained in each cluster is unbalanced,and there are noise features in the data that are useless or even misleading for clustering analysis.The problems in efficiency include: large amount of data and high data dimension,which makes the data space extremely complex and the algorithm needs to consume a lot of computing resources,and even unable to complete the cluster analysis within an acceptable time.For super multi-cluster clustering tasks,the main research contents of this paper are as follows:(1)For the task of clustering analysis of super multi-cluster data,a general clustering method based on multi-angle space structure representation is proposed.Since each dimension in the data input space describes the original characteristics of each record,it cannot directly reflect the overall distribution law of data,and the core challenge faced by the super multi-cluster clustering task is the complex structure of clusters.Therefore,it is necessary to map the data to a new feature space in order to obtain clearer and more accurate distribution information of the cluster structure and facilitate the subsequent clustering process.In this paper,the space structure of the data from different perspectives was unified and fused to provide clearer and more accurate information of the cluster structure for the subsequent cluster analysis.On this basis,a multi-angle space structure clustering framework is proposed.Experimental analysis on real data shows that this method can improve the clustering performance of super multi-cluster data to a certain extent.(2)Super multi-cluster data is often accompanied by unbalanced distribution of clusters and complex distribution structure,which brings challenges to the clustering performance of multi-angle space structure clustering algorithms.To solve this challenge,this paper proposes a performance improvement method for multi-angle space structure representation clustering method.Through the complete sampling method,the cluster distribution of multiple sampling points is integrated to improve the probability of small clusters in random sampling,in order to make the sampling results include all the original clusters as far as possible under the condition of high cluster imbalance.The proposed feature-weighted sampling method alleviates the problems of invalid clustering analysis or noise features in the original data.By examining the contribution of each dimension of data to clustering results,different sampling weights are given to data features,so as to effectively reduce the impact of lowquality features on clustering analysis,the performance of multi-angle space structure clustering method is further improved.(3)Super multi-cluster data is often accompanied by the feature of large data scale,which challenges the execution efficiency of multi-angle space structure clustering algorithm.Aiming at this challenge,this paper proposes an acceleration method based on multi-angle space structure representation clustering algorithm.By introducing Nyst(?)m method to approximate the space structure representation matrix of the data,the computation process of space structure representation is accelerated.At the same time,based on approximate representation,the process of eigendecomposition of spatial structure representation matrix is accelerated.Thus,the two most complex processing processes in the multi-angle space structure clustering algorithm are accelerated,and the running efficiency of the algorithm is effectively improved when the data of super multiclusters are faced.Experimental results show that the proposed algorithm greatly reduces the running time of the multi-angle space structure clustering algorithm on large-scale data clustering tasks,and improves the execution efficiency of the multi-angle space structure clustering algorithm.The above research provides ideas for solving the problems caused by super multi-clusters in the current clustering task,provides a new algorithm framework for clustering analysis,and provides a new strategy for clustering analysis of super multi-cluster data.
Keywords/Search Tags:Machine Learning, Cluster Analysis, Super Multi-Cluster, Representation of Space Structure
PDF Full Text Request
Related items