Font Size: a A A

Text Retrieval Method For Microblog Based On Language Modeling

Posted on:2013-10-18Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiFull Text:PDF
GTID:2268330392967965Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Microblog provides a new platform for information dissemination. On Microblog,people can share their feeling, experience and opinions freely and quickly with140characters at most. Microblog has been seen of great growth since its appearance.Twitter is popular all around the world with over100million users, generating avast number of posts every day. The same is happening with Weibo at China.Along with the growth of Microblog, it has become an important source forpeople to gather information. Unlike traditional webpages, information onMicroblog is more about real-time news and hot topics.“Information is nothingwithout retrieval”. With the growing information on Microblog, how to find whatusers need efficiently and effectively is the main purpose of Microblog retrieval.This paper investigates two characteristics of Microblog retrieval:(1) relevance,which means that the information retrieved should be relevant to the user’s query;(2) recency, which means that the information retrieved should be as new aspossible. Modern popular Microblog search engines share a simple retrieval model:Microblog posts that contain all terms in query are ranked reverse chronologically.Although this method takes the two properties of Microblog retrieval into account,its “too strict” relevance judgment loses many relevant results.This paper investigates the relevance and recency of Microblog retrieval with theframework of language modeling (LM) approach. There are mainly two parts of LMapproach, the relevance model and the document prior model. The document priormodel represents the self-importance of document and is independent of query.Generally speaking, the more recent the Microblog post is the more import it will be.Therefore, the self-importance of Microblog post can be captured by its creationtime. This intuition leads to a document based on the creation time of document,called recency prior. Experimental results show that performance can be improvedfrom about4%to5%by incorporating recency prior. Although the originalmodeling method of relevance is based on multiple Bernoulli distribution, themodern dominated one is based on multinomial distribution, and it is consideredmore effective than multiple Bernoulli model. However, recent work shows that themultiple Bernoulli model can outperform multinomial model in the context of sentence retrieval, claims the effectiveness of multiple Bernoulli model for textretrieval with short length. Therefore, an investigation of the application of multipleBernoulli model to Microblog retrieval is non-trivial. Experimental results showthat multiple Bernoulli model outperforms multinomial model with Microblogretrieval. Moreover, multiple Bernoulli is more stable than multinomial model withthe variation of smoothing parameter. To sum up, retrieval model that combinesmultiple Bernoulli model and recency prior gives best performance.Besides the traditional ranking method that orders results by their relevance, thispaper investigates the method of ranking results of Microblog retrieval by theircreation time, by re-ranking the results produced from language modeling approach.We mainly study the method for re-ranking threshold selection, and discuss thescore-distribution method. This method computes the threshold by modeling thedocument score generated by retrieval model. The score of relevant document ismodeled with Gaussian distribution and that of non-relevant document is modeledwith Exponential distribution. The parameters of two models are estimated withExpectation Maximization (EM) algorithm when no relevance judgment is available.Experimental results show that the score-distribution threshold selection methodoutperforms the method that sets a fixed value by human for about9%. Moreover,the automatic threshold selection method also avoids the problem that threshol d setby human is usually not optimal without any heuristics.At last, this paper combine the language modeling approach the automaticthreshold selection method to produce the results ranked by their creation time, andcompares its performance with the modern popular Microblog retrieval method thatranks the posts with all query-terms included reverse chronologically. Experimentalresults show that our method can greatly outperform the modern popular method forabout78.3%.
Keywords/Search Tags:Microblog Retrieval, Language Modeling approach, Multiple Bernoullimodel, Multinomial model, Automatic Threshold Selection
PDF Full Text Request
Related items