Font Size: a A A

Research On Spectral Clustering Methods For Large Scale Datasets

Posted on:2016-02-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:L M ChenFull Text:PDF
GTID:1318330518472883Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of science and technology,the human ability for collecting data improves greatly.Since every industry has accumulated or is accumulating numerous data,people have to analyse these data to get meaningful information.Good achievements on clustering analysis are important to research on data analysis and inference.Spectral clustering methods are perfect in all clustering analysis methods and are the research focus in the related clustering analysis fields.Spectral clustering methods can cluster data in arbitrary object spaces and convergence to the global optimal solution.Spectral clustering methods can discover the nonlinear characters of data in the low dimensional space,and spectral clustering methods can be used for dimensionality reduction of any object.Spectral clustering methods also perform the dual task of embedding these objects into Euclidean space,while performing the dimensionality reduction.Therefore,spectral clustering methods are extremely popular for clustering arbitrary objects.But the computational complexity of spectral clustering methods is so high that they can only be used for small datasets.The research on spectral clustering large scale datasets is rare.In recent years,the amount of data accumulated by every industry is very large.In order to get better clustering results for large scale datasets and make the data analysis more accurate and more effective,it is necessary to study spectral clustering methods for large scale datasets.Firstly,for the time and space complexity of Laplacian matrix eigen-decomposition being too high to spectral clustering large scale numerical datasets,and enlightened by the idea of candid covariance-free incremental principal component method,a fast method for computing the first k small eigenvectors of Laplacian matrix is proposed.Two symmetric positive semi-definite matrices are constructed,and the first k large eigenvectors of the two constructed matrices are equal to the first k small eigenvectors of Laplacian matrices.The first k small incremental eigenvectors can be computed by loop inputting each column vector of the one constructed matrix.The proposed method for eigen-decomposing Laplacian matrix only needs linear time and n order of magnitude space.The experiments also showed that the proposed method was effective and efficient.Secondly,for the sample amount being very large but the attribute space tending to be stationary for some large mixed type datasets in the course of time,a novel general clustering framework based on quasi-hyper-image mapping is proposed for clustering these large mixed type datasets efficiently and effectively.Each attribute of the dataset is quantized and the quantization number of each attribute is corresponding to each dimension size of the quasi-hyper-image.Then the large dataset is mapped on the quasi-hyper-image to get a small quasi-hyper-image pixel dataset by mapping equations and the similarity function between pixels is defined.Spectral clustering the small quasi-hyper-image pixel dataset can achieve a global distribution of the dataset.Mapping a dataset on a quasi-hyper-image is in linear time,spectral clustering speed is rapid because the scale of quasi-hyper-image is small,and the storage space is also small because of mapping,so the proposed method meets the demands for speed and storage.Experiments demonstrated that this proposed method was efficient and quite appropriate for clustering large mixed type dataset.Thirdly,for clustering heterogeneous information networks being important,a fast clustering algorithm based on an approximate commute time embedding for heterogeneous information networks with a star network schema is proposed by utilizing the sparsity of heterogeneous information networks.A heterogeneous information network is transformed into multiple compatible bipartite graphs from the compatible point of view.Then the approximate commute time embedding of each bipartite graph is computed using random mapping and a linear time solver.All of the indicator subsets in each embedding simultaneously determine the target dataset.A general model is formulated by these indicator subsets,and a fast algorithm is derived by simultaneously clustering all of the indicator subsets using the sum of the weighted distances for all indicators for an identical target object.The proposed fast algorithm,FctClus,is shown to be efficient and exhibits high clustering accuracy and fast computation speed based on a theoretic analysis and experimental verification.Finally,for partitioning dynamic heterogeneous information networks,a fast evolutionary clustering algorithm for heterogeneous information networks with a star schema is proposed by taking advantageous of the sparsity of heterogeneous information networks.The heterogeneous information network is transformed into multiple compatible bipartite graphs from the point of compatible view.Then each bipartite graph is temporal smoothed and these temporal smoothing bipartite graphs can contain enough relation between nodes at snapshot and relation between nodes in prior time.And then by sparsifying the temporal smoothing bipartite graph,the approximate commute time embedding of each temporal smoothing bipartite graph can be computed via random mapping and a linear time solver.The target dataset is simultaneously indicated by multiple embedding subsets.Last,the sum of weighted distances for all the indicators of each embedding for indicating an identical object is computed,and the cluster that the target objects belong to is computed by the sum of weighted distances.This proposed method is effective,and the clustering accuracy is higher than that of other methods by experimental verification.
Keywords/Search Tags:Spectral clustering, Large scale dataset, Eigen-decomposition, Mapping, Heterogeneous information network, Embedding, Temporal smoothing
PDF Full Text Request
Related items