Font Size: a A A

Research On Key Technologies Of Short Microblog Retrieval

Posted on:2014-09-18Degree:MasterType:Thesis
Country:ChinaCandidate:X W LiFull Text:PDF
GTID:2268330422451698Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
A micoblog is a short text messages of no more than140characters sharedthrough a social media, such as Twitter or Sina weibo. Microblogs, used byhundreds millions of users, appear attractive insofar as they promise access totimely information written by people who they have chosen to pay attention to.Within the past five years, microblogs have greatly developed and now become atypical representative of social media, and also one indispensable timelyinformation source. In this work, we use microblog to denote those microblogsshared through Twitter.Microblog data has grown explosively. Thus, how to help user exactly findmicroblog posts that they are interested becomes an important mission of themicroblog information retrieval. Owing to the limited text size,informallywritten, noisy and real-time nature of microblogs, traditional IR models run intodifficulty. In order to address this issue, this paper researches on the keytechnologies to resolve the poignant problems faced in microblog retrieval. Indetails, this thesis is carried out in respect to the following four aspects:1. Hot time based language modeling approach. In this section, we firstinvestigate two methods of time based language model which both are under thehypothesis of “the newer document is, the more important it is”. Then wedemonstrate that this assumption may not always work by the analysis ofrelevance document’s temporal distribution for specific query. Finally, we definequery’s hot time, and proposed a hot time-based language model approach toretrieval microblog posts. Meanwhile, we do an experimental comparison to priorapproaches.2. Query Modeling integration with temporal information. In this section,we utilize temporal properties (e.g. recency and temporal variation) to enrich auser query to improve retrieval performance. In detail, we explore three queryexpansion (QE) methods. The first QE model based upon recency can suggestcandidate terms for query which favors more recent document. The second QEmodel can handle temporal variations consisting of an old peak far from thequery time or a multimodal temporal variation. This model selects goodexpansion terms based on the minimum KL-divergence between temporalprofiles of original query and expanded query. The third QE model ingeniously combines the first two kinds of time-aware QE methods through adaptiveweighting, as we assume that document ages are generated from Gaussiandistribution.3. Microblog retrieval based on reference document model (RDM). In thissection, we introduce RDM to estimate the document model more accuracy onthe basis of our analysis to the potential difficulties when estimating a singlemicroblog’s document model. And we also study the impact of the performancethrough document expansion (DE). Furthermore, we build both a query modeland a document model by the pseudo feedback process from reference documents.The results of our experiments represent that, RDM is able to improve microblogretrieval performance dramatically compared to traditional methods. This thesisfinds that leveraging the content of URL included in short microblog messagecan greatly enhance the retrieval performance.4. Ranking tweets via learning to rank. In this section, we propose a newranking strategy which considers not only the content relevance of a tweet, butalso the account authority and tweet-specific features such as a URL link is existin the tweet. Thus we employ learning to rank algorithms to fuse these differentfeatures to generate a better tweet ranking function. With a series of experiments,we try to determine the best set of features through analyzing the effects of eachindividual feature and using various feature subset selection.
Keywords/Search Tags:microblog retrieval, language model, temporal information retrieval, reference document model, learning to rank
PDF Full Text Request
Related items