Font Size: a A A

Using Statistical Language Modeling For Ad Hoc Information Retrieval

Posted on:2007-07-30Degree:DoctorType:Dissertation
Country:ChinaCandidate:G D DingFull Text:PDF
GTID:1118360185995704Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Relied on the powerful statistical inference theory, statistical language modeling (SLM) has gradually become one of the crucial techniques in lingual information processing. Since 1998 a large number of studies have been working on the language models for ad hoc information retrieval (IR). In recent years research work carried out by many groups has confirmed that the language modeling approaches are theoretically attractive and potentially very effective probabilistic framework for IR.In this paper we detailedly describe the language modeling approaches to IR, focusing on the basic idea and principle of query likelihood retrieval model. Some significant extensions and improvements to the query likelihood model are also presented, such as KL-Divergence retrieval model, etc. On the basis of these introductions, we focus our research on the problems of document modeling and query modeling in the language modeling approaches, including document language model estimation and smoothing, heuristic query expansion and its integration into the query likelihood retrieval model, and query language model re-estimation in the KL-Divergence retrieval model.In the language modeling approaches, a core technique for document language model estimation is smoothing, which adjusts the maximum likelihood estimator so as to correct the inaccuracy due to data sparseness. In this paper we study and compare several popular smoothing methods and their influences on retrieval performance by examining the sensitivity of retrieval performance to the smoothing parameters on different collections. We propose a new linear-interpolated smoothing method– GJM-2, which utilizes the number of unique terms in the document to improve the accuracy of language model estimation. Furthermore, considering the facts that most traditional smoothing methods neglect the differences between terms, we propose a Term-Risk based smoothing model by incorporating the risk component related to each term into traditional methods. Experimental results show that using GJM-2 or the Term-Risk based smoothing model for the language modeling approach can achieve better retrieval performances than the existing popular smoothing methods both on short and long queries.In information retrieval, heuristic query expansion (HQE) is an important technique for improving retrieval performance. We study how to integrate the local-analysis based HQE into query likelihood retrieval model. Considering the inborn deficiency of the query likelihood model for HQE, we propose WQL (Weighted Query Likelihood) retrieval model by...
Keywords/Search Tags:Information Retrieval, Statistical Language Modeling, Query Likelihood Retrieval Model, Document Language Model, Smoothing, GJM-2, Query Expansion, WQL, LOCOOC, KL-Divergence Retrieval Model, Word Association Network, Query Language Model
PDF Full Text Request
Related items