Font Size: a A A

Research On Text Clustering Based On Latent Semantic Analysis And Self-organizing Maps

Posted on:2011-04-06Degree:MasterType:Thesis
Country:ChinaCandidate:C L ZhangFull Text:PDF
GTID:2178330332978452Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of Internet, the scale of information is more and more large, the increasing speed is more and more high, the type and structure of information is more and more complicated. The main problems human encounter now is how to efficiently and exactly access the useful information from the large scale of information, but before it was difficult to get the useful information from the scattered and little scale of information. With the flexibility and ability of auto-processing,text clustering has been an important branch of data mining and applied to many fields. Text clustering can cluster text information efficiently, re-organize and navigate the massive text information, and improve the efficiency and precision of query and accessing. It can be said, text clustering research has a great importance on both theory and application.After the deep study of the whole process of text clustering, Latent Semantic Analysis (LSA) with good advantages of dealing with semantics problem and dimension reduction, and Self-Organizing Maps (SOM) with good characteristics of self-organizing, auto-processing, visualization and clustering results,are used to explore and study the effectiveness of text clustering, and it is validated in an sample corpus. In a general, the main works of this thesis are as follows:Firstly, the pre-processing of text is studied. This process has a great direct impact on the text clustering results. It mainly includes extraction of the abstract content, tagging of part of speech, filtering of part of speech, filtering of stop words, and construction of term-document space. Document object model (DOM) is used to parse web pages and extract the abstract content in the process of extraction of the abstract content. The rules-based approach is used to do part of speech tagging, and the regular expression technology is used to exclude the unrelated part of speech and reserve the noun, verb and adjective ingredients. Finally the term-document space is constructed by vector space model with term frequency weight.Secondly, the latent semantic analysis is studied. It can used to effectively deal with these semantic issues (ex. Synonyms and Polysemy) as well as reduce the dimension of term-document space. The mathematical principles of Latent semantic analysis model and the details of the singular value decomposition are studied. After the compare of many methods which are used to calculate the weights of the term-document space, the right and appropriate method is chosen. And then the approximate new space of original term-document space is created using singular value decomposition.Thirdly, the SOM clustering algorithm is studied. The traditional SOM clustering algorithm is ineffective in some situations, so the alternative SOM algorithm is used to replace it. The alternative algorithm divide the training process of traditional SOM into rough training and fine training, which can make full use of math calculation and get better effect.Finally, the clustering tests are implemented on the pre-processed corpus and the results of them are deeply analyzed. The results show that latent semantic analysis improves the efficiency of clustering, latent semantic analysis effectively deals with the semantic problem of natural language processing, and the alternative SOM algorithm improves the effectiveness compared with the k-means method.
Keywords/Search Tags:text clustering, latent semantic analysis, singular value decomposition, self-organizing maps, part of speech tagging
PDF Full Text Request
Related items