Font Size: a A A

Information Retrieval Oriented Analysis Of Text Content

Posted on:2008-09-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y HuFull Text:PDF
GTID:1118360242476098Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Information retrieval is the important and key problem in information service. It is the measure for people when facing the "information explosion". The research on how toautomatically and effectively organize information and search information has very high values of theory and practice for using large scale of information. The retrieval research includes retrieval model, information processing and its applications. This thesis present several methods for these problems, respectively, and the object processed in these studies is text data. First, a recursive conceptual graph based retrieval model is presented in this thesis. Second, an approach to extracting the conceptual (attribute, value) structure oriented knowledge from a machine readable dictionary is explored, and a method for automatically constructing the relations labeled by attribute names between concepts in unstructured texts is proposed. At last, this thesis explores the text clustering and sentiment analysis for textinformation processing.Concretely to say, this thesis makes the contributions in below for information retrieval:(1) A recursive conceptual graph formalism is presented to describe the meaning ofdocument contents and users' queries in a specified domain. This formalism is defined based on the (attribute, value) structure. It expects using nested conceptual graphs corresponding to the combination of syntactic parts to implement the mapping from syntactic structure to semantic structure. This kind of parallelism could make the synchronization between semantic analysis and syntactic analysis in future. Based on this recursive style, this thesis indexes some documents and queries, and proposes a new comparison algorithm betweengraphs to address the relativity issue.(2) A Chinese machine-readable dictionary is exploited to extract the conceptual knowledge, i.e. the (attribute, value) structure from the corresponding definitions of nominal entries. By comparing the previous work of acquiring word knowledge from free texts and dictionaries, it finds that a dictionary is an advantaged resource for extracting discriminative knowledge of concepts. Our method focuses on constructing the attribute-value extracting patterns and the statistical decision for applying these patterns. Therefore the work is designed to be a new three-step procedure that is different from previous dictionary extracting studies which parse the definitions first. (3)To serve the conceptual graph based retrieval, a bootstrapping method for automatically extracting semantic patterns from a large-scale corpus to identify three relations between Chinese concepts in contexts is explored in this thesis. Our contributions different from other bootstrapping methods lie in introducing a bi-sequence alignment algorithm from bio-informatics to generate candidate patterns, and giving a new evaluating metric for patterns' confidence to enhance their extracting qualities in next iteration. In terms of automatic recognition of these three relations, the experiments show that the pattern set generated by our method achieves higher coverage and precision than DIPRE does.(4)In this thesis, a new similarity of text on the basis of combining cosine measure with the quantified conceptual relations by linear interpolation for text clustering is presented. These relations derive from the entries and the words in their definitions in a dictionary, which are quantified under the assumption that a entries and its definition are equivalent in meaning. This kind of relations is regarded as "knowledge" for text clustering. Under the framework of k-means algorithm, the new interpolated similarity improves the performance of clustering system significantly in terms of optimizing hard and soft criterion functions. The results show that introducing the conceptual relations from the un-structured dictionary into the similarity measure could provide contributions for text clustering.(5)This thesis presents a generative model based on the language modeling approach for sentiment analysis. By characterizing the semantic orientation of documents as "favorable" or "unfavorable", this method captures the subtle information needed in text retrieval. In order to conduct this research, this thesis explores the global and local language modeling approaches, respectively. It uses Kullback-Leibler divergence between the language model estimated from test document and the two trained sentiment models for global language modeling, and uses the dependent linkages between a domain "term" and other ordinary words in the contexts by exploiting a triggered language model for the local analysis. The better results motivate us to consider finding more suitable language models for sentiment detection in future research.
Keywords/Search Tags:retrieval model, dictionary extraction, conceptual relation construction, text clustering, sentiment analysis
PDF Full Text Request
Related items