Font Size: a A A

Latent Semantic Retrieval Based On Document Clustering Analysis

Posted on:2014-02-28Degree:MasterType:Thesis
Country:ChinaCandidate:C J WuFull Text:PDF
GTID:2248330398476030Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
The traditional document retrieval is based on keywords matching based strictly, to the user query keywords and the system database storing text keyword matching, according to the size of the matching degree of the relevant text detection. But in practice, this method is inadequate. Firstly, it is difficult to determine the specific meaning of the keywords in the text of the expression, consistent with the content retrieval, because of the ambiguous words. Secondly, the theme of this article can be composed of different keywords, so strictly matching may lose a lot of relevant text. In view of lexical ambiguity, semantic analysis method (Latent Semantic Analysis, LSA) is an effective solution to this keyword strict matching problems.LSA believes that keyword of text through certain structure is connected, and a collection of keywords shows the theme of the text. LSA combined with mathematical and computer are based on the analysis of a large corpus of text and the word frequency, the keywords and text are mapped to the term-document matrix A, and then through the singular value decomposition (SVD), divided into the final decomposition into lexical matrix, document matrix, and the diagonal matrix connection function. Therefore, even if the absolute matching and document keywords don’t contain the user queries keywords, it can also pass through the projection of the key word’s semantic space, to find a corresponding semantic space, so long as the theme and documents are the same, and compare the similarity cosine document and the value of key words, then retrieve the relevant documents.In this paper, based on LSA in the background, the basic principle and application, we discussed the starting user queries based on keywords to establish the literature chain structure. Establishment of the literatures on the relationship of chains depends on the level of the same key words in different articles. Obviously, the more keywords in the two different literatures, the more relevance in the two articles. By searching the literature, we can take the union of many key words in the literature, and get more keywords, the high frequency keywords intersection, so that we can obtain high quality key words. Followed by a high quality keyword can’t stop doing circular search, and get more of the level2, level3.....N keyword set, of course, this time also get more literature. However, the lack of semantic key words, we can not only use high quality key words of each level or the corresponding literature to create the literature chain from grade1to grade n. But we need LSA Latent Semantic Intelligent Analysis and classification of1to n keywords. Through the establishment of document frequency matrix, which is decomposed into three matrices by SVD, to intercept the document matrix dimension reduction after treatment for K clustering, according to user preferences, selected the extended keyword similarity calculation of cluster center and the center point of high similarity in the literature have the same operations, based on the similarity of the output chain literature.
Keywords/Search Tags:latent semantic analysis, Cluster analysis, the reduction of dimension, singularvalue decomposition, literature chain
PDF Full Text Request
Related items