Font Size: a A A

A Study Of Concept-based Information Retrieval Model

Posted on:2013-10-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:X H TuFull Text:PDF
GTID:1228330371974879Subject:Education Technology
Abstract/Summary:PDF Full Text Request
With the rapid spread of the Internet and the continually emergence of cheap mass storage devices, human society has produced vast amounts of digital documents. The surprising number of documents can be considered as a treasure trove of human knowledge, but also makes us increasingly dependent on information retrieval system to find the information we need. In traditional information retrieval models, the bag-of-words model is usually used to represent documents and queries. However, human natural language is a very complex system of symbols, there exits various complex relations between words, including synonymy, ambiguity and semantically related. The simple bag-of-words model, ignoring the rich semantic relationships between words, is far from being able to characterize the complex semantic information inherent in natural language.In this thesis, the concept refers to a basic unit of meaning. Our understanding of natural language is naturally a process of semantic association and imagination, which is provided by complex physiological organization made by the tens of billions of neurons in our human brain. The seamless integration of the semantic knowledge contained in the concepts related to the text and the traditional text representation model, will be a potential way to build semantic-based retrieval system, and also the key issue to solve in this thesis. In this thesis, we will make a comprehensive study of concept-based information retrieval system. The main contribution includes the following parts:1) We propose several approaches to generate concept annotation for different type of texts. For texts in document collection of some professional domain such as biomedicine, we can directly use the concepts which the expert marked. More commonly, however, the text does not contain any concept annotation. We construct a common concept system based on Wikipedia knowledge and propose a supervised learning based approach to automatically generate Wikipedia concept annotation for texts. In addition, we propose an automated concept extraction method to generate concept annotation for Chinese texts.2) We propose several approaches to generate semantic representation model for concepts in different types of concept system. For concepts in professional dictionaries, we use a mutual information based approach to generate semantic representation model. For concepts in Wikipedia, we propose a mixture model based and a mutual information based approach to generate semantic representation model, respectively. In addition, we propose a semantic relatedness based representation model for automatically extracted concepts in Chinese text. 3) We propose a novel concept-based smoothing method for document model. The semantic based document model is generated by seamlessly integrating semantic representation model of concepts into the word-bag based document model. The experiments are conducted on several standard retrieval test collections, including professional document collections and news collections. The results show this approach perform significantly better than traditional retrieval models.4) We propose a novel concept-based smoothing method for query model. Two concept annotation models are developed to annotate concepts for query. In the first model, the concepts in pseudo-relevance feedback documents are used as candidate concepts. The second model directly use the concepts interactively selected by user to generate query model. The experiments are conducted on several standard retrieval test collections, including professional document collections and news collections. The result shows that the retrieval performance of this method is significantly improved relative to traditional retrieval models.5) We propose a Chinese retrieval model based on semantic relatedness between concepts. In this model, various important features, including semantic relatedness between concepts and traditional features, are seamlessly integrated into the machine learning based retrieval framework. The experiments are conducted on several standard Chinese retrieval test collections. The result shows that the retrieval performance of this method is significantly improved relative to BM-25model.
Keywords/Search Tags:Concept-based information retrieval, Language model, Document model, Querymodel, Learning to rank, Chinese index, Concept annotation
PDF Full Text Request
Related items