Font Size: a A A

Modeling Topic-based Semantics For Information Retrieval Models

Posted on:2021-03-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:F H JianFull Text:PDF
GTID:1368330605958576Subject:Education Technology
Abstract/Summary:PDF Full Text Request
In the information age,information search begins to be an important part of our lives.The information retrieval technology behind is particularly important.Based on bag of words,traditional information retrieval models use terms to match query and document,and then return related document.One of their key components is explicit term frequency normalization,which generally needs to optimize its hyper-parameters.Most of term associations/dependencies models and pseudo relevance feedback models also depend on explicit term features,which may lead to missing match.In order to solve the problem,researchers in information retrieval domain start to utilize topic models to mine implicit topic-based semantics of docuemts and terms.Topic model can be natural integrated into language model framework and achieve reasonably good performance in many cases.However,how to integrate implicit topic-based semantics of docuemts and terms into other famous traditional information retrieval models(e.g.probabilistic model)and pseudo relevance feedback models is still largely unexplored.In this paper,we study a new term frequency normalization in probabilistic model BM25,traditional retrieval models combined topic-based semantics of terms and pseudo relevance feedback models with topic-based semantics of docuemts,respectively.The main contributions of this paper are as follows.Firstly,we propose a new term frequency normalization model for probabilistic information retrieval In probabilistic model BM25,term frequency normalization is one of the key components.It is often controlled by parameters k1 and b,which need to be optimized for each given data set.We assume and show empirically that term frequency normalization should be specific with query length in order to optimize retrieval performance.Following this intuition,we first propose a new term frequency normalization with query length for probabilistic information retrieval,namely BM25QL.Then BM25QL is incorporated into the state-of-the-art model CRTER2,denoted as CRTER2QL.A series of experiments on fourteen standard TREC datasets show that our proposed approaches BM25QL and CRTER2QL are comparable to BM25 and CRTER2 with the optimal b setting in terms of MAP on almost all of the datasets.Secondly,we propose a simple enhancement for ad-hoc information retrieval via topic modelling.Traditional information retrieval models,in which a document is normally represented as a bag of words and their frequencies,capture the term-level and document-level information.Topic models,on the other hand,discover semantic topic-based information among words.We consider term-based information and semantic information as two features of query terms and propose a simple enhancement for ad-hoc information retrieval via topic modeling.In particular,four topic-based hybrid models,LDA-BM25,LDA-BM25QL,LDA-MATF and LDA-LM,are proposed.A series of experiments on fourteen standard TREC datasets show that our proposed models can always outperform significantly the corresponding strong baselines over all datasets in terms of MAP and most of datasets in terms of P@5 and P@20.A direct comparison on fourteen standard datasets also indicates our proposed models are at least comparable to the state-of-the-art approaches CRTER2 and LBDMThirdly,we propose a simple reranking framework TopRerank via integrating topic similarity into traditional retrieval models.In the reranking framework,it is an effective extension for topic-based retrieval models that conducting topic model on the top 1000 documents returned from the first pass retrieval.Topic similarity between a document and top 3 documents can be used for the topic-based relevance between the document and query.We integrate the topic-based relevance into traditional retrieval models for reranking.Experimental results on fiftteen standard TREC datasets show that,our proposed topic similarity based reranking methods not only significantly outperform the corresponding baselines in terms of MAP and NDCG,but also are at least comparable to the state-of-the-art topic-basd retrieval models.At last,we propose a pseudo relevance feedback framework via integrating topic-based relevance of documents.Traditional pseudo relevance feedback via query expansion could imporve retrieval performance.The selection of expansion terms in is generally base their fetures in feedback doucuments,in which they treat each feedback document equally and do not take hte relevance of docuemts into account.Based on topic modelling,we propose a more universal topic-base relevance of documents for pseudo relevance feedback.In particular,we integrate topic-based relevance of documents into two state-of-the-art Rocchio's model and relevance model RM3,denoted TopRoc and TopRM3,and explore two formulas for calculating topic relevance.Experimental results on five representative TREC datasets show that our proposed TopRoc models always have significant improvement over the corresponding strong baselines on most datasets in terms of MAP,and TopRM3 models can always outperform the corresponding strong baselines over all datasets in terms of MAP.A direct comparison on five standard TREC datasets also indicates our proposed models are at least comparable to the state-of-the-art topic-basd pseudo relevance feedback approaches.
Keywords/Search Tags:Term frequency normalization, Probabilistic model, Topic Similarity, Reranking, Pseudo relevance feedback
PDF Full Text Request
Related items