Font Size: a A A

Research On Chinese Concept Retrieval Based On Latent Semantic Analysis

Posted on:2006-12-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y F LiuFull Text:PDF
GTID:1118360182969935Subject:Systems Engineering
Abstract/Summary:PDF Full Text Request
Most information on Internet is based on text. The explosive growth of text information is a great challenge to information retrieval, making it increasingly difficult to find useful information on internet rapidly and accurately. There exists uncertainty in natural languages, such as synonym and polyseme. Therefore, the same concept can be expressed in different ways. It is called the Anomalous State of Knowledge. In the traditional information retrieval based on keyword match, what match are the explicit representation, but not the concepts they express. It is not easy for users to express what they really want to retrieve just with keyword or keyword chains. Much work has been done to retrieve based on concept (semantic), instead of keyword match. The requests of retrieval users are dealt with from the perspective of concept. As a statistical model for natural language, Latent Semantic Analysis (LSA) is known as a method for knowledge acquisition, induction and representation. Compared with other retrieval models, such as concept library based or concept network, LSA-based retrieval model is easy to compute and requires less human intervention. Latent semantic space is established by truncated singular value decomposition. In the latent semantic space, terms and documents are projected onto the dimensionalities that represent latent concepts. Then the semantic relationships among terms are abstracted to present the semantic structure of natural languages. The theoretical basis of LSA remains to be expanded and further explained. Takes Chinese LSA as the subject and Chinese concept retrieval as background, some difficult problems in LSA are studied in detail, such as the weight computing, dimension characteristics of latent semantic space. Weight computing is of great importance in LSA. The traditional methods of weight definition are inherited from VSM, which ignore the intrinsic contrast between LSA and VSM. After defining the global weights of terms, the dimensionalities of generated semantic space highlight the semantic relationships among terms with bigger weights. Document's semantic is respresented by term included in it, while term's semantic should be understanded by numerous documents including it. The global weight definition of documents is proposed to correct the weight computing of LSA. Experiment result shows that entropy weight is better than other global weights, while higher retrieval precision is obtained with less dimensionality after expanding weight computing model. The dimensionality in latent semantic space represents latent concept. Without explicit concept to compare, it is hard to be understood clearly, limiting the application and development of LSA. Along with the increase of dimenson to be eliminated, the relativities between terms evolve regularly. The relationship in a broader perspective among documents (terms) is mainly embodied in the corresponding dimensionality of the big singular values, while the relationship in a local perspective is embodied in the small singular values. It is concluded that there is some implicit similarity between the dimensionality in the latent semantic space and concept granularity. Hierarchical document clustering is used to prove this conclusion. It is shown that different dimensionality in the latent semantic space results in document clustering with different concept granularity. It is also found that the proposed Document Self-indexing Matrix can effectively prohibit isolated points in clustering, thus greatly improving the accuracy of clustering.. For application purposes, two difficult problems in LSA-based retrieval system are studied: quick retrieval and boolean retrieval. Response time is an important index to evaluate an information retrieval system. LSA retrieval model cannot achieve quick retrieval by directly using the traditional index data structure. Based on the characteristic research on the dimensionality of latent semantic space, a quick retrieval algorithm of low-dimensionality is proposed to reduce the computation complexity so as to quickly exclude irrelevant documents. In the quick retrieval algorithm of compressed encoding, the original LSA document vectors are represented in approximately compressed encoding and all possible correlation degrees of each dimensionality in the compressed encoding are saved as quick seek table. Weighted 0-1 encoding is a typical compressed encoding method. Experiment result shows that the combination of compressed encoding and low-dimensionality filter algorithms can quickly find the object documents. Boolean expression retrieval is an indispensable function for users to fulfill individual and complex retrieval. Some concepts in data field, such as potential, superposed potential, isopotential, is introduced into LSA boolean semantic retrieval as a direct-vision evaluating method. Studies in-depth LSA theory basis such as weight computing and dimensionality characteristics, and solves two difficult problems of LSA's application in IR. LSA is a science extremely dependent on experiments. An experimental platform named "ChineseLetant Semantic Analysis System"has been developed in the research. In this system, each vital link in LSA is correpond to special experimental method, and presents the result visiually. All aspects in the dissertation are evidenced with experiments on this system. LSA is a technology that may be widely used in Chinese concept retrieval. Some conclusions drawn will provide a guideline and basis for Chinese concept retrieval both theoretically and practically.
Keywords/Search Tags:Latent Semantic Analysis, Information Retrieval, Weight Computing, LSA Space Dimension Character, Quick Query, Semantic Boolean Retrieval
PDF Full Text Request
Related items