Research On Text Clustering Based On Latent Semantic Analysis And Self-organizing Maps

Posted on:2011-04-06

Degree:Master

Type:Thesis

Country:China

Candidate:C L Zhang

Full Text:PDF

GTID:2178330332978452

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the development of Internet, the scale of information is more and more large, the increasing speed is more and more high, the type and structure of information is more and more complicated. The main problems human encounter now is how to efficiently and exactly access the useful information from the large scale of information, but before it was difficult to get the useful information from the scattered and little scale of information. With the flexibility and ability of auto-processing,text clustering has been an important branch of data mining and applied to many fields. Text clustering can cluster text information efficiently, re-organize and navigate the massive text information, and improve the efficiency and precision of query and accessing. It can be said, text clustering research has a great importance on both theory and application.After the deep study of the whole process of text clustering, Latent Semantic Analysis (LSA) with good advantages of dealing with semantics problem and dimension reduction, and Self-Organizing Maps (SOM) with good characteristics of self-organizing, auto-processing, visualization and clustering results,are used to explore and study the effectiveness of text clustering, and it is validated in an sample corpus. In a general, the main works of this thesis are as follows:Firstly, the pre-processing of text is studied. This process has a great direct impact on the text clustering results. It mainly includes extraction of the abstract content, tagging of part of speech, filtering of part of speech, filtering of stop words, and construction of term-document space. Document object model (DOM) is used to parse web pages and extract the abstract content in the process of extraction of the abstract content. The rules-based approach is used to do part of speech tagging, and the regular expression technology is used to exclude the unrelated part of speech and reserve the noun, verb and adjective ingredients. Finally the term-document space is constructed by vector space model with term frequency weight.Secondly, the latent semantic analysis is studied. It can used to effectively deal with these semantic issues (ex. Synonyms and Polysemy) as well as reduce the dimension of term-document space. The mathematical principles of Latent semantic analysis model and the details of the singular value decomposition are studied. After the compare of many methods which are used to calculate the weights of the term-document space, the right and appropriate method is chosen. And then the approximate new space of original term-document space is created using singular value decomposition.Thirdly, the SOM clustering algorithm is studied. The traditional SOM clustering algorithm is ineffective in some situations, so the alternative SOM algorithm is used to replace it. The alternative algorithm divide the training process of traditional SOM into rough training and fine training, which can make full use of math calculation and get better effect.Finally, the clustering tests are implemented on the pre-processed corpus and the results of them are deeply analyzed. The results show that latent semantic analysis improves the efficiency of clustering, latent semantic analysis effectively deals with the semantic problem of natural language processing, and the alternative SOM algorithm improves the effectiveness compared with the k-means method.

Keywords/Search Tags:

text clustering, latent semantic analysis, singular value decomposition, self-organizing maps, part of speech tagging

PDF Full Text Request

Related items

1	Research Of Text Clustering Based On Self-Organizing Maps
2	Chinese Text Clustering Based On Latent Semantic And Its Applications
3	Research On Some Field Text Information Processing Based On Latent Semantic Analysis
4	Research On Text Clustering Algorithm Based On Latent Semantic Indexing
5	Research On Web Text Categorization Based On Latent Semantic Analysis
6	Research On Text Classification Method Based On Part Of Speech Tagging LDA Model
7	Research On Text Clustering Based On Self-Organizing Maps
8	Text Classification Based On Latent Semantic Indexing
9	Based On Latent Semantic Indexing, Text Classification And Research In Science And Technology Information Retrieval
10	Research On Text Document Information Hiding