Font Size: a A A

Research Of Key Technologies On Document Cluster Ensemble

Posted on:2011-02-03Degree:DoctorType:Dissertation
Country:ChinaCandidate:S XuFull Text:PDF
GTID:1118330332460507Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As one of the most important research topics in data mining and pattern recognition, clustering analysis has been widely used in areas such as data compressing, information retrieval, speech recognition, characters recognition, image segmentation, document clustering, etc. Besides, it has been attracted more and more attention in biology, geology, geography, marketing and abnormal data detection. Thousands of clustering algorithms exist, yet no one can successfully recognize clusters with different sizes, different shapes, and different densities or even with noises. Many of them are not applicable for document clustering due to the high dimensionality and sparseness of documents; furthermore, the magnanimity of document sets imposes high efficiency on clustering algorithms. As an important extension to convenient clustering algorithms, cluster ensemble technique has become a hotspot in machine learning area because it has many advantages that convenient clustering algorithms do not have. Taking document clustering as application background, this dissertation studies the key problems in document cluster ensemble, and obtains the following innovative contributions:(1) Spectral clustering algorithms which are based on matrix perturbation theory and spectral graph theory, respectively, are brought into document cluster ensemble problem. To make the algorithms extensible to large scale applications, the eigenvalue decomposition problem of large scaled matrixes is avoided by solving the eigenvalue decomposition of small matrixes, and thus computational complexity of the algorithms is effectively reduced. Experiments on real-world document sets show that (a) the algebraic transformation is feasible; (b) both of the proposed ensemble algorithms based on spectral cluatering algorithms outperform other common cluster ensemble techniques based on graph partitioning, and they can effectively solve document cluster ensemble problem.(2) The key idea of spectral clustering algorithms is introduced into solving document cluster ensemble problem. Beginning with seeking the"best"subspace, the low dimensional embeddings of documents and hyperedges are attained simultaneously, whereupon two simple and fast algorithms, SSICA and SSDCA, are proposed. Experiments on real-world document sets show that the proposed algorithms outperform other common methods and SSICA is a little better than SSDCA. SSICA is further extended and a low dimensional embedding method is proposed. Firstly, spectral clustering algorithms are performed to obtain the low dimensional embeddings of hyperedges. Then the low dimensional embeddings of documents are attained indirectly by composition of mappings and finally K-means algorithm is performed to cluster the documents according to their coordinates in the low dimensional space. Experiments on real-world document sets show that the proposed method outperforms other common cluster ensemble technologies based on graph partitioning algorithms and can also effectively solve document cluster ensemble problem.(3) Non-negative Matrix Factorization (NMF) is brought forth into document cluster ensemble problem and BNMF is proposed. Since NMF converges slowly and tends to converge to poor solution, NMFK is proposed which use K-Means to initialize NMF. Furthermore, in order to solve the unstable problem due to the random initialization of K-Means, NMFKMMP is proposed which uses minimum and maximum principle to attain the initial centroids. Experimental results show that: (a) using K-Means to initialize NMF is effective because NMFK can get better and more stable results than BNMF, and it is more efficient than BNMF; (b) NMFKMMP can effectively solve document cluster ensemble problem because it outperforms other cluster ensemble technologies and is very efficient.(4) To further boost the performance of cluster ensemble algorithm, the key idea of CHAMELEON, Divide and Merge (DM) strategy, is introduced to reinforce the ability of spherical K-means algorithm to discover clusters with irregular shapes. In ensemble member generation phase, SKM was first performed to obtain multiple document sub-clusters for r times, each time using Ward, an agglomerative hierarchical method, to merge these sub-clusters according to their similarity, and attained an ensemble of r clusterings. Then the cluster ensemble algorithms are performed to ensemble the r clusterings. Experiments on real-world document sets show that DM strategy increased with different degrees the normalized mutual information (NMI) scores of the cluster ensemble algorithms based on hierarchical clustering method, spectral method, low dimensional embedding method and non-negative matrix factorization method except for the graph partitioning-based method. These results prove that DM strategy can effectively improve the performance of document cluster ensemble algorithms.
Keywords/Search Tags:clustering analysis, document cluster ensemble, algebraic transformation, low dimensional embedding, non-negative matrix factorization, divide and merge
PDF Full Text Request
Related items