Spectral Clustering Research With LSA On Text Clustering

Posted on:2011-12-13

Degree:Master

Type:Thesis

Country:China

Candidate:Z Q Bao

Full Text:PDF

GTID:2178360308458722

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

The traditional clustering algorithms such as k-means and EM algorithm are based on a spherical sample space, when the sample space is not convex, the algorithm will stop at a local optimum. Derived from the spectral graph theory, the spectral clustering algorithm, can work on any spatial dataset without considering its type, and can convergence to the global optimal value. It can transform a complex clustering problem into a simple algebraic problem, which greatly simplifies the clustering model. Recent years, because of its good natures, spectral clustering has been paid more and more attentions.However, the spectral clustering algorithm itself also has many deficiencies. Among the deficiencies, the problem that similarity matrix, the basis for spectral clustering, involves many parameters is the most representative one. Current researches show that spectral clustering is very sensitive to similarity matrix. Therefore, how to construct a good similarity matrix becomes very important. But there are no current academic guiding principles to indirect how to construct a similarity matrix.Additionally, text preprocessing, as the most critical step for text clustering, Vector Space Model (VSM) is often adopted. Vector Space Model has many advantages, such as simple, easy to be followed by the mathematical treatment. However, the model has a fatal flaw that the dimension the text vector representing is high; synonyms and polysemy between the features caused a lot of redundancy. In response to this problem, Latent Semantic Analysis (LSA) using Singular Value Decomposition (SVD) to reduce the dimensionality of the space and to improve the relationship between vectors in the text vector space. Accompanied with that, the storage space and the processing time are also reduced.This paper improves spectral clustering with Latent Semantic Analysis (LSA).Using the advantages of Latent Semantic Analysis, a similarity matrix spectral clustering needs is constructed. This paper made the following researches:①Analyze two shortages of current text vector space model: Firstly, the high-dimensional vector space costs the computer a lot of time for data processing; Secondly, the assumption that text features are independent is difficult to meet; there are also redundancies in a large number of features. To solute these problems, we adapt the combination of Latent Semantic Analysis and Spectral Clustering. ②As for similarity matrix for spectral clustering, this paper does not specifically focus on many parameters which affect the construction of similarity matrix, but apply Latent Semantic Analysis to reconstruct the text in the semantic similarity space, which enhances the ability that the text vector expresses, and avoids similarity matrix, when the parameters are different, influencing the clustering algorithm effects, to the maximum extent.③This paper finds a new simplified technique of Latent Semantic Analysis, which could run faster and the effect is also very good. This result further indicates that the approach in this paper is correct and feasible.

Keywords/Search Tags:

Clustering analysis, latent semantic analysis, spectral clustering, similarity matrix

PDF Full Text Request

Related items

1	Text Spectral Clustering Research Via Probabilistic Latent Semantic Analysis
2	Clustering Algorithm Based On Semantic Similarity Analysis Of The Soft Component
3	Research And Application Of Spectral Clustering
4	Research And Application Of Spectral Clustering Based On Density Adaptive Neighborhood
5	Finding Web Services Based On Clustering Probabilistic Semantic Approach
6	Latent Semantic Retrieval Based On Document Clustering Analysis
7	Chinese Text Clustering Based On Latent Semantic And Its Applications
8	Microblog Hotspot Detection Based On Semantic Analysis And Two-step Clustering
9	Study And Implementation On Latent Semantic Space Analysis And Web Document Clustering Based On LDA
10	Research On Text Clustering Based On Latent Semantic Analysis And Self-organizing Maps