Font Size: a A A

Inverted Index Based Micro-blog Topic Detection

Posted on:2014-10-07Degree:MasterType:Thesis
Country:ChinaCandidate:H Y LiuFull Text:PDF
GTID:2268330392969573Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of micro-blog, an emerging social network, moreand more users use it and post news. There is a great demand in detecting hotevents in micro-blog data, because of the fast propagation of micro-blog and greatsignificance of news spread. However there is huge challenge in topic detectionand tracking because of the large-scale, much noise and shorter text of micro-blogdata.After analysis the shortcomings of the traditional topic detection and trackingalgorithm, this paper proposed an inverted index based method to increase theprocessing of the algorithm, but have no harm to the accuracy of the algorithm.Some artificial rules are established to remove the noise after analysis themicro-blog data. Topic detection and tracking algorithm is processed after invertedindex is build. When get the new events, sort the events based on the entropy ofthe event and number of user. The top20events of the list are merged with the oldevents. The old events are collected based on the aging theory. This paper alsoanalysis the result of AP cluster algorithm on micro-blog data.In order to verify the efficiency increasing of the algorithm, the processingtime of inverted index based algorithm and traditional SINGLE-PASS algorithmare compared on several different scale data set. The inverted index algorithm is6times faster than the SINGLE-PASS algorithm. A test set was build by artificiallabel because of no standard corpus. The test set include26events, total of2817documents. The algorithm shows to get a better result after analysis the experimentresults. The effective of different weight compute method is also compared. Andthe disadvantages and advantages of AP clustering algorithm and SINGLE-PASSalgorithm is compared.Because of the large scale of micro-blog data, some event get by thealgorithm is not interested. After sort the event by using the entropy and thenumber of user, the event which users interested in can be put in the front.A micro-blog topic detection and tracking system was design andimplemented by using the algorithm proposed before.
Keywords/Search Tags:topic detection and tracking, inverted-index, AP algorithm, dynamicwindow
PDF Full Text Request
Related items