Latent Semantic Retrieval Based On Document Clustering Analysis

Posted on:2014-02-28

Degree:Master

Type:Thesis

Country:China

Candidate:C J Wu

Full Text:PDF

GTID:2248330398476030

Subject:Applied Mathematics

Abstract/Summary:

The traditional document retrieval is based on keywords matching based strictly, to the user query keywords and the system database storing text keyword matching, according to the size of the matching degree of the relevant text detection. But in practice, this method is inadequate. Firstly, it is difficult to determine the specific meaning of the keywords in the text of the expression, consistent with the content retrieval, because of the ambiguous words. Secondly, the theme of this article can be composed of different keywords, so strictly matching may lose a lot of relevant text. In view of lexical ambiguity, semantic analysis method (Latent Semantic Analysis, LSA) is an effective solution to this keyword strict matching problems.LSA believes that keyword of text through certain structure is connected, and a collection of keywords shows the theme of the text. LSA combined with mathematical and computer are based on the analysis of a large corpus of text and the word frequency, the keywords and text are mapped to the term-document matrix A, and then through the singular value decomposition (SVD), divided into the final decomposition into lexical matrix, document matrix, and the diagonal matrix connection function. Therefore, even if the absolute matching and document keywords donâ€™t contain the user queries keywords, it can also pass through the projection of the key wordâ€™s semantic space, to find a corresponding semantic space, so long as the theme and documents are the same, and compare the similarity cosine document and the value of key words, then retrieve the relevant documents.In this paper, based on LSA in the background, the basic principle and application, we discussed the starting user queries based on keywords to establish the literature chain structure. Establishment of the literatures on the relationship of chains depends on the level of the same key words in different articles. Obviously, the more keywords in the two different literatures, the more relevance in the two articles. By searching the literature, we can take the union of many key words in the literature, and get more keywords, the high frequency keywords intersection, so that we can obtain high quality key words. Followed by a high quality keyword canâ€™t stop doing circular search, and get more of the level2, level3.....N keyword set, of course, this time also get more literature. However, the lack of semantic key words, we can not only use high quality key words of each level or the corresponding literature to create the literature chain from grade1to grade n. But we need LSA Latent Semantic Intelligent Analysis and classification of1to n keywords. Through the establishment of document frequency matrix, which is decomposed into three matrices by SVD, to intercept the document matrix dimension reduction after treatment for K clustering, according to user preferences, selected the extended keyword similarity calculation of cluster center and the center point of high similarity in the literature have the same operations, based on the similarity of the output chain literature.

Keywords/Search Tags:

latent semantic analysis, Cluster analysis, the reduction of dimension, singularvalue decomposition, literature chain

Related items

1	Research On Text Summarization Based On Latent Semantic Analysis
2	Study Of Multi-WebPages Automatic Abstracting Based On Latent Semantic Analysis
3	Research On Chinese Concept Retrieval Based On Latent Semantic Analysis
4	The Application And Research Of Latent Semantic Analysis In The Field Of Internet Data Mining
5	The Intelligent Search Technology Based On Latent Semantic Analysis
6	Application Of PCA Dimensionality Reduction Method Based On Latent Variables In Text Classification Problems
7	Research On Some Field Text Information Processing Based On Latent Semantic Analysis
8	Research And Apply On Patient Record Text Mining Based On Latent Semantic Analysis
9	The Intelligent Retrieval System Based On Latent Semantic Analysis
10	Research And Implementation Of Incremental Dimensionality Reduction Methods For Big Data