Chinese Text Clustering Based On Latent Semantic And Its Applications

Posted on:2009-05-29

Degree:Master

Type:Thesis

Country:China

Candidate:Y Jian

Full Text:PDF

GTID:2178360308979272

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With development of Internet technology, the amount of text information stored in digital format has been explosively increasing. Hence, the information needs to be organized efficiently so that users can use it conveniently, and text clustering technique is proposed as required. Traditional text clustering makes use of VSM (Vector Space Model) to get structured term-document matrix from semi-structured text, then cluster the text based on matrixes. VSM only matches words and documents with letters, but due to some uncertain phenomenon in natural language, such as synonym and polysemy, the result quality clustered using VSM is not high. To solve this problem, many researches focus on an intelligent method, namely latent semantic analysis.Latent Semantic Analysis (LSA) can be regard as an extension of vector space model. LSA uses VSM to yield term-document matrix, which presents the texts of data sets, employs TSVD (Truncated Singular Value Decomposition) to build a low-dimension latent semantic space, then exploits k-means algorithm to cluster the text in the space. This thesis mainly discusses the clustering effects of Chinese text based on LSA, and analyzes the influence factors. When filtering the noise, TSVD drops some minor-class features. To weaken ignoring the minor-class document, an improved latent semantic analysis model based on term replace is proposed. Owing to the time and space complexity are lower, the common method of text clustering is k-means algorithm. However, k-means algorithm has several limitations: choosing initial class centre of divisions is random; divisions have a great difference in shape is not applicable; too sensitive to noises and outliers. To against the deficiency, the thesis transforms the text to data point, improves k-means clustering algorithm with the model of interaction force among molecules, utilizes cloud model to certain outliers. At last, the thesis employs improved LSA model to put forward a multilayer text clustering model based on potential interests of users.The experimental results show that the improved LSA model can provide the better solution to thesaurus and polysemous word problem, and obviously avoid ignoring the minor-class document; improved k-means algorithm has much better clustering results, can increase the efficiency of text processing; text clustering model based on user potential interests provides better text cluster service.

Keywords/Search Tags:

text clustering, latent semantic analysis, singular value decomposition, k-means, data field, cloud model

PDF Full Text Request

Related items

1	Research On Some Field Text Information Processing Based On Latent Semantic Analysis
2	Research On Text Clustering Algorithm Based On Latent Semantic Indexing
3	Research On Text Clustering Based On Latent Semantic Analysis And Self-organizing Maps
4	The Application And Research Of Latent Semantic Analysis In The Field Of Internet Data Mining
5	Research On Web Text Categorization Based On Latent Semantic Analysis
6	Research On The Application Of Personalized Recommendation Based On Collaborative Filtering Algorithm
7	Research And Implementation Of K-means++ Algorithm Improvement And Search Application Based On Latent Semantics
8	Text Classification Based On Latent Semantic Indexing
9	Based On Latent Semantic Indexing, Text Classification And Research In Science And Technology Information Retrieval
10	Research On Text Summarization Based On Latent Semantic Analysis