Font Size: a A A

Research On Clustering Algorithms For Large-scale Complex Data

Posted on:2020-12-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:X W ZhaoFull Text:PDF
GTID:1368330578472960Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As an important unsupervised machine learning method and typical data mining technology,clustering analysis has attracted wide attention in the communities of academia and industry.In recent years,according to the needs of different application fields,researchers have developed a series of clustering models and algorithms.And these methods have played an important role in data analysis in image processing,information retrieval,social networks and bioinformatics.However,with the rapid development and wide application of a series of emerging technologies such as big data and Internet of Things,a amount of large-scale complex data have been accumulated in many fields such as social activities,scientific research,mobile Internet and so on.These data sets show the characteristics of large-scale of sample size,high dimensionality of features size,mixed of features representation,complexity of internal structure and so on.There are serious challenges for clustering analysis in terms of model,algorithm or application.Therefore,how to mine the implicit cluster structures from large-scale complex data has become a challenging research topic.Aiming at the characteristics of large-scale complex data,i.e.,largescale,high-dimensional,mixed and complex,this paper systematically studies clustering analysis models and methods by using sampling,subspace clustering,clustering ensemble,graph compression and other technologies.Specifically,the main research contents and achievements of this paper are as follows:(1)Aim at the low computational efficiency of large-scale data clustering algorithm,a stratified sampling-based clustering algorithm framework is proposed.Compared with most other sampling-based clustering algorithms,the proposed framework takes into account the distribution information of data sets in the sampling process.A data stratification containing a large number of data objects or large variances should be sampled more objects to represent the original data.This difference is conducive to producing more representative sample subsets and better partial clustering results.A large number of experiments verify the effectiveness and efficiency of the proposed framework.(2)Aiming at the effectiveness of mixed high-dimensional data clustering,a soft subspace clustering algorithm for high-dimensional mixed data is proposed.Firstly,in order to measure the difference between objects and clusters more accurately and objectively,an extended Euclidean distance for mixed data is designed.Secondly,by fusing different types of information entropy,the uncertainty measurement between clusters and within cluster is realized.Based on this,a feature weighting method for each cluster is given.The effectiveness of the proposed method is verified on real data.(3)Aiming at the quality of base clustering and the difference between them in clustering ensemble,a sequential base clustering generation algorithm for mixed data based on information entropy is proposed.This algorithm establishes a unified clustering result validity criterion for numerical data and categorical data using differential entropy and complementary entropy,respectively.Based on this criterion and normalized mutual information,the high quality and diversity base clusterings can be effectively generated.A series of experiments verify the effectiveness of the proposed algorithm.(4)In order to solve the problem of contribution difference of base clustering in the process of clustering ensemble,a clustering ensemble selection algorithm for categorical data is proposed.This algorithm measures the quality and difference of base clustering members by using the5 internal validity indices and normalized mutual information,respectively.More accuracy clustering results are obtained by iteratively selecting high-quality and high-diversity base clustering to ensemble.The effectiveness and robustness of the proposed algorithm are verified on several real data sets.(5)To solve the problem of computation efficiency of complex network data clustering,a large-scale social network clustering algorithm based on graph compression is proposed.According to the nature of social networks,a compressed graph is firstly obtained by iteratively merging vertices with the degree of 1 and 2 into their neighbours with higher degree.Then,two indices,i.e.,density and quality of vertices,are defined to evaluate the possibility of vertices as clustering centers.By taking these two measures into consideration together,in the compressed social network,the initial clustering centers and the number of clusters are determined simultaneously.After obtaining the clustering structure on the compressed social network by center expansion,the clustering results are propagated to the original social network.Extensive experiments conducted on various social networks have demonstrated the superiority of the proposal as compared to several existing state-of the-art clustering algorithms.And the algorithm is applied to social recommendation algorithm,which can effectively improves the computational efficiency of the recommendation algorithm.The research results in this paper not only enrich the research content of clustering analysis,but also provide technical support for data analysis in the fields of social network and bioinformatics.
Keywords/Search Tags:Large-scale Data, Clustering Validity, Information Entropy, Clustering Ensemble, Subspace Clustering, Recommendation Algorithm
PDF Full Text Request
Related items