Font Size: a A A

Text Spectral Clustering Research Via Probabilistic Latent Semantic Analysis

Posted on:2013-12-24Degree:MasterType:Thesis
Country:ChinaCandidate:H ZhangFull Text:PDF
GTID:2248330362974191Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
At present,research on clustering algorithms is a hotspot in the area of datamining,which has been widely used in the areas of search engine, scientific datadetection, information filtering, web analysis and image processing etc. As a novelclustering analysis algorithm, spectral clustering is achieved easily, which can not onlyprocess complex data types and convert clustering problems into algebraic problems,but also can discover clusters of arbitrary shapes, and obtain global optimal solution.However, spectral clustering algorithms also have their limitations. In general, thesimilar matrix in spectral clustering is created based on the vector space model(VSM),but the VSM neglects issues like synonymy and polysemy, which lead to greatredundancy. Besides, the fact that spectral clustering is very sensitive to scaleparameter of Gaussian function, makes the performance of spectral clustering unstable.To attack the above problems, firstly this thesis uses probabilistic latent semanticanalysis to extract latent semantic information to overcome the defect of lackingsemantic information description in vector space model, then uses cosine method incomputing similarity matrix to eliminate the impact of the scale parameter. Finally, theimproved algorithm is applied in text clustering.During this procedure, the main workof this thesis is as follows:①Analyzing the limitations of current vector space model: on one hand,traditional model neglects the synonymy and polysemy, then leads to featureredundancy as a result. On the other hand, due to the high dimension characteristic oftext features, vector space model consumes too much time in text preprocessing. Tosolve these problems, this thesis proposes a spectral clustering algorithm combiningwith probabilistic latent semantic analysis.②Studying the background theory and methodology of spectral clusteringalgorithms, summarizing the general procedure of spectral clustering algorithm alongwith the construction of similarity matrix in detail.③Using cosine method in computing similarity matrix to eliminate the impactof the scale parameter. In traditional spectral clustering, computing similarity matrixneeds empirical initialization of scalar parameter in Gaussian function, which affectsthe performance of spectral clustering. This thesis doesn’t further study theoptimization of the scalar parameter, just use cosine method instead. Finally, this thesis conducts text spectral clustering on similarity matrix, thenevaluates the experimental results with clustering accuracy index and normalizedMutual Information method. In terms of computing the similarity between texts insemantic vector space, compared to original Gaussian function method, proposedcosine method achieves better result and more stable performance. In summary,experimental results demonstrates the feasibility of proposed method in this thesis.
Keywords/Search Tags:Clustering analysis, Spectral clustering, Probabilistic latent semantic analysis, Similarity matrix
PDF Full Text Request
Related items