Font Size: a A A

Research On Document Clustering Technology Based On Latent Semantic Indexing

Posted on:2010-04-02Degree:MasterType:Thesis
Country:ChinaCandidate:W ZhengFull Text:PDF
GTID:2178360272985262Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of information technology, a large amount of document sources are needed to be effectively organized for topic discovery, information retrieval, etc. To meet these requirements, the document clustering technique emerges in time, which is an important research topic of natural language processing. Many progresses have been made in the research of document clustering. The natural language phenomena such as a great number of synonyms and polesemy exist in document clustering. Latent Semantic Indexing (LSI) is used to discuss and resolve these phenomena in order to improve the performance of document clustering in this thesis.Singular Value Decomposition (SVD) technology of the LSI transforms the original term space to the corresponding smaller latent semantic space, during which the terms with high document frequency introduce some unreasonable term transfer relations that influence the similarity between terms and the similarity between documents in the document sets. This thesis proposes a feature optimize technology in latent semantic indexing by making use of the transfer relation of terms in the documents and between the documents in document sets. This method can choose the transfer relations in latent semantic space, and the experimental results show that this method can improve the performance of LSI effectively.In the research of document clustering algorithm, clustering algorithms based on partition are sensitive to the initial points and prone to be trapped in local optimization. This thesis proposes a method based on the center of sum function of the minimal similarity by analyzing the character of initial points in the thesis. The K documents are selected as the initial points of the different categories in document sets and the similarity sum of these K documents are to be the smallest in this method. So this method avoids splitting the category which has a great of documents into small categories and the initial points to be border points. The experimental results show that this method can effectively reduce the iterative process and improve the performance of the document clustering. Finally, an information retrieval system based on the LSI is implemented, the term transfer relation for the information retrieval initial results are selected, and the results by document clustering algorithm are adjusted. The system is tested with some IR4QA corpus of NTCIR-7 international evaluating. The experimental results show that this method can greatly improve the information retrieval performance.
Keywords/Search Tags:Document clustering, Latent Semantic Indexing, Term transfer relation, Selection of initial center points
PDF Full Text Request
Related items