Font Size: a A A

Research Of Multiple Clustering Algorithms Based On Matrix Factorization

Posted on:2020-07-19Degree:MasterType:Thesis
Country:ChinaCandidate:X WangFull Text:PDF
GTID:2428330599956774Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Clustering is a typical traditional unsupervised learning paradigm.It divides the similar samples into the same clusters and dissimilar samples into different clusters based on the similarity between samples.In this way,it can simply analyze the the intrinsic distribution of the data.Therefore,it is often applied to the data preprocessing stage.However,most traditional clustering algorithms can only produce a single clustering.In many real-world applications,there are many ways to partition a same dataset,and each of them can uncover a different angle to understand the dataset.For examples,the fruits can be clustered according to color or species;in bioinformatics,proteins can be classified according to amino acid sequence or 3D structure.Obviously,clustering from different angles will result in different clusterings,and each of them can reflect the hidden structure of the data,which makes it very meaningful to find a variety of different clusterings from the same dataset.Therefore,multiple clusterings research has become a hot and difficult area in clustering analysis in recent years.Existing multiple clustering algorithms are broadly divided into unsupervised and semi-supervised ways.Unsupervised multiple clustering algorithms can simultaneously mine multiple clusterings by integrating redundant control into a unified objective function.Their shortcomings are as follows:(1)it cannot control the redundancy very well between clusterings;(2)As the number of multiple clustering increases,it is not only difficult to optimize the final objective fucntion,but also the redundancy control becomes worse.Semi-supervised multi-clustering algorithm uses known clustering to constrain the generation of subsequent clusterings.Their disadvantage are:(1)it is easily affected by the known clustering results.If the results of the previous clustering are poor,the subsequent clustering results will be affected;(2)the independence of the feature space is not well controlled,which result in the obtained multiple clusterings with less interpretable.In addition,the existing multiple clustering algorithms only focus on clustering from the sample-wise,ignoring the clustering results of the feature-wise,while clustering from both the sample and feature dimensions simultaneously(Co-Clustering)is also important and widely used.According to the above problems in multiple clustering,and aiming at improving the accuracy and interpretability and extending the application scenario of multiple clustering,this paper proposed two multi-clustering algorithms based on independent subspace analysis and matrix decomposition.The main work of this thesis is as follows:1.We proposed a multiple clustering algorithm based on independent subspace analysis and nonnegative matrix factorization(MISC,Multiple Independent Subspace Clusterings).MISC firstly uses independent component analysis to partition the features into independent subspaces.In order to determine the number of subspaces,MISC apply the Minimum Description Length technique to encode the different subspaces,and select the corresponding subspace division under the minimum coding length.Then MISC apply a simply clustering algorithm based on nonnegative matrix factorization for different independent subspaces;considering the manifold structure and nonlinear structure of the dataset,MISC integrates the kernel techniques and manifold-regularized terms into the nonnegative matrix factorization,and finally obtains multiple optimized subspace clusterings.The experimental results on both simulated and real data show that MISC can not only better partition subspaces,but also obtain more accurate multiple clusterings than other multi-clustering algorithms.2.In order to integrate the co-clustering structure into multi-clustering,this paper proposes a Multiple Co-clusterings algorithm(MultiCC,Multiple Co-Clusterings)based on nonnegative matrix tri-factorization.MultiCC can obtain a row-cluster and a columncluster indicator matrix by performing one time nonnegative matrix tri-factorization,i.e.co-clustering.In order to obtain multiple co-clusterings and reduce their redundancy,MultiCC factorizes the original matrix multiple times,and constructs two nonredundancy term to enforce diversity among row-clusters and column-clusters;finally,the non-redundancy item is integrated the matrix factorization to guide the matrix decomposition,thereby MISC get multiple co-clusterings.The visualization results and the evaluation metrics on the real dataset and the gene expression datasets show that MultiCC can not only obtain multiple single clusterings with less redundancy,but also can mine multiple co-clsuterings with high quality compared to other comparison algorithms.3.A Multiple Co-Clusterings algorithm based on nonnegative matrix trifactorization(MCC-SS,Multiple Co-Clusterings in Subspaces)is proposed,which mine multiple co-clsuterings from the subspace of data and optimizes the MultiCC algorithm.It first apply a projection matrix to map the original data into a new subspace,and then,borrowing the MultiCC method,utilizes the column-clusters indicate matrix and projection matrix to construct non-redundancy terms and integrate them to guide the matrix decomposition.Finally,MCC-SS obtains multiple co-clusterings in subspaces by simultaneously optimizing the projection matrix,row-clusters and column-clusters indicator matrix.The results on multiple real datasets show that MCC-SS can not only obtain multiple single clusterings with less redundancy,but also can mine multiple coclsuterings in subspaces compared to other comparison algorithms.
Keywords/Search Tags:Clustering, Co-clustering, Multiple clustering, Independent Subspace, Nonnegative matrix factorization
PDF Full Text Request
Related items