Font Size: a A A

Research On Heterogeneous Data Clustering Algorithm

Posted on:2016-08-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:X X YangFull Text:PDF
GTID:1318330542975984Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Clustering problem is the most basic problem in pattern analysis area.The task of clustering is to divide the data into the same cluster,which have similarity property or strongly related.Clustering analysis can help us comprehend the knowledge hidden in the dataset.Traditional clustering algorithm focuses on homogeneous data set,the property of which is described by simple feature space or simple relation.However,the rapid progress of information technology,especially internet technology,has brought much heterogeneous data set,the property of which is described by multiple feature spaces or multiple relations.In order to efficiently take advantage of heterogeneous information to detect the cluster structure hidden in the heterogeneous data set,heterogeneous data co-clustering method is proposed and attract much attention of academic community.The main research content and contributions of this paper are as follows:1.In order to detect the hierarchical cluster patterns hidden in the high-order heterogeneous data set,for star-structure high-order heterogeneous data set,we develop a high-order hierarchical co-clustering algorithm(HHCC).Goodman-Kruskal ? is used to measure association of objects and features in each space,which is an index measuring association of categorical variables.The strongly related objects are partitioned into the same objects clusters,and simutaneously the strongly related features are partitioned into the same features clusters.Goodman-Kruskal ? is used to evaluate the quality of clustering results.The bigger the Goodman-Kruskal ? is,the better the quality of clustering results is.A locally search approach is used to optimize Goodman-Kruskal ?.The number of clusters is automatically quantified in the optimal process.The top-down split strategy is adopted and each cluster is split into sub-clusters with maximal Goodman-Kruskal ?.A tree-like hierarchical cluster structure of high-order heterogeneous data is obtained at last.2.Exiting algorithms focus on unsupervised learning.However,in real world application,some background prior knowledge can be easily obtained.It has been demonstrated that background prior knowledge can effectively improve clustering performance.Furthermore,in order to efficiently mine overlapped cluster structure,ahigh-order heterogeneous data semi-supervised fuzzy co-clustering algorithm(SS-HHFC)is proposed.In order to efficiently describe the clustering results of data objects in overlapping clusters,SS-HHFC algorithm introduces the fuzzy concept,which uses degree of membership to describe the relation between data objects and clusters.Competitive agglomeration is used to measure the relationship strength between heterogeneous data clusters.The task of high-order co-clustering is to divide the strongly correlated objects into the same cluster.For this reason,Competitive agglomeration can be used to evaluate the quality of co-clustering.And then the problem of heterogeneous data clustering is formulated as the problem of maximizing a competitive agglomeration cost function,taking into account the background prior knowledge.In order to solve the optimal problem,the update rules for fuzzy memberships are derived,and the computational process is designed for SS-HHFC algorithm.And the convergence of SS-HHFC algorithm is proved theoretically and experimentally.3.Heterogeneous data set always contain noises and outliers.In order to counteract the adverse effect of noises and discover outliers,we develop a weighted nonnegative matrix factorization for heterogeneous data co-clustering algorithm(WNMF-HCC).WNMF-HCC algorithm benefits from the interactions among different types of data objects and iteratively embeds each type of data objects into low dimensional spaces.Based on the contribution of data objects to the objective function,WNMF-HCC evaluates the weights of all the data objects.The smaller weights are assigned to the noises and outliers.And then based on weights,the adverse effect of noise is counteracted and outliers are discovered.And the convergence of WNMF-HCC algorithm is proved theoretically and experimentally.4.The noises contained within the multi-view data set influence the clustering results.In order to improve the robustness of clustering algorithm to noise,a robust multi-view clustering algorithm based on possibilistic C-means(PCM-RMVC)is proposed.PCM-RMVC algorithm abandons constraint that sum of membership degree is one,so the noise membership degrees to all clusters are small.And then the adverse influence of noise is weakened and the robustness to noise is improved.In order to simultaneously incorporate the view information in multiple feature spaces,PCM-RMVC algorithm minimizes theweighted combination of the distances between objects and cluster prototypes in each view space.The update rules for fuzzy memberships and weights for each view are derived,and then the computational process is designed for PCM-RMVC algorithm.And the convergence of PCM-RMVC algorithm is proved theoretically and experimentally.In the end,the work of the dissertation is concluded and the further research direction is put forward.
Keywords/Search Tags:Heterogeneous data, Co-clustering, Multi-view clustering, Hierarchical clustering, Nonnegative matrix factorization, Fuzzy clustering, Robustness
PDF Full Text Request
Related items