Font Size: a A A

Chinese Text Clustering Based On Latent Semantic And Its Applications

Posted on:2009-05-29Degree:MasterType:Thesis
Country:ChinaCandidate:Y JianFull Text:PDF
GTID:2178360308979272Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With development of Internet technology, the amount of text information stored in digital format has been explosively increasing. Hence, the information needs to be organized efficiently so that users can use it conveniently, and text clustering technique is proposed as required. Traditional text clustering makes use of VSM (Vector Space Model) to get structured term-document matrix from semi-structured text, then cluster the text based on matrixes. VSM only matches words and documents with letters, but due to some uncertain phenomenon in natural language, such as synonym and polysemy, the result quality clustered using VSM is not high. To solve this problem, many researches focus on an intelligent method, namely latent semantic analysis.Latent Semantic Analysis (LSA) can be regard as an extension of vector space model. LSA uses VSM to yield term-document matrix, which presents the texts of data sets, employs TSVD (Truncated Singular Value Decomposition) to build a low-dimension latent semantic space, then exploits k-means algorithm to cluster the text in the space. This thesis mainly discusses the clustering effects of Chinese text based on LSA, and analyzes the influence factors. When filtering the noise, TSVD drops some minor-class features. To weaken ignoring the minor-class document, an improved latent semantic analysis model based on term replace is proposed. Owing to the time and space complexity are lower, the common method of text clustering is k-means algorithm. However, k-means algorithm has several limitations: choosing initial class centre of divisions is random; divisions have a great difference in shape is not applicable; too sensitive to noises and outliers. To against the deficiency, the thesis transforms the text to data point, improves k-means clustering algorithm with the model of interaction force among molecules, utilizes cloud model to certain outliers. At last, the thesis employs improved LSA model to put forward a multilayer text clustering model based on potential interests of users.The experimental results show that the improved LSA model can provide the better solution to thesaurus and polysemous word problem, and obviously avoid ignoring the minor-class document; improved k-means algorithm has much better clustering results, can increase the efficiency of text processing; text clustering model based on user potential interests provides better text cluster service.
Keywords/Search Tags:text clustering, latent semantic analysis, singular value decomposition, k-means, data field, cloud model
PDF Full Text Request
Related items