Font Size: a A A

Spectral Clustering Research With LSA On Text Clustering

Posted on:2011-12-13Degree:MasterType:Thesis
Country:ChinaCandidate:Z Q BaoFull Text:PDF
GTID:2178360308458722Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
The traditional clustering algorithms such as k-means and EM algorithm are based on a spherical sample space, when the sample space is not convex, the algorithm will stop at a local optimum. Derived from the spectral graph theory, the spectral clustering algorithm, can work on any spatial dataset without considering its type, and can convergence to the global optimal value. It can transform a complex clustering problem into a simple algebraic problem, which greatly simplifies the clustering model. Recent years, because of its good natures, spectral clustering has been paid more and more attentions.However, the spectral clustering algorithm itself also has many deficiencies. Among the deficiencies, the problem that similarity matrix, the basis for spectral clustering, involves many parameters is the most representative one. Current researches show that spectral clustering is very sensitive to similarity matrix. Therefore, how to construct a good similarity matrix becomes very important. But there are no current academic guiding principles to indirect how to construct a similarity matrix.Additionally, text preprocessing, as the most critical step for text clustering, Vector Space Model (VSM) is often adopted. Vector Space Model has many advantages, such as simple, easy to be followed by the mathematical treatment. However, the model has a fatal flaw that the dimension the text vector representing is high; synonyms and polysemy between the features caused a lot of redundancy. In response to this problem, Latent Semantic Analysis (LSA) using Singular Value Decomposition (SVD) to reduce the dimensionality of the space and to improve the relationship between vectors in the text vector space. Accompanied with that, the storage space and the processing time are also reduced.This paper improves spectral clustering with Latent Semantic Analysis (LSA).Using the advantages of Latent Semantic Analysis, a similarity matrix spectral clustering needs is constructed. This paper made the following researches:①Analyze two shortages of current text vector space model: Firstly, the high-dimensional vector space costs the computer a lot of time for data processing; Secondly, the assumption that text features are independent is difficult to meet; there are also redundancies in a large number of features. To solute these problems, we adapt the combination of Latent Semantic Analysis and Spectral Clustering. ②As for similarity matrix for spectral clustering, this paper does not specifically focus on many parameters which affect the construction of similarity matrix, but apply Latent Semantic Analysis to reconstruct the text in the semantic similarity space, which enhances the ability that the text vector expresses, and avoids similarity matrix, when the parameters are different, influencing the clustering algorithm effects, to the maximum extent.③This paper finds a new simplified technique of Latent Semantic Analysis, which could run faster and the effect is also very good. This result further indicates that the approach in this paper is correct and feasible.
Keywords/Search Tags:Clustering analysis, latent semantic analysis, spectral clustering, similarity matrix
PDF Full Text Request
Related items