Font Size: a A A

Blog Oriented Real-time Hot Event Detection And Tracking Approach

Posted on:2011-05-01Degree:MasterType:Thesis
Country:ChinaCandidate:J C WengFull Text:PDF
GTID:2178360332458125Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Topic detection and tracking has been a hot research spot both at home and broad, it has broad application prospects such as Monitoring public opinion. Our research goal is to detect and track domestic and international hot events through the blog in order to feedback to the user in real-time. Most traditional clustering algorithms can not well solve the problems on TDT since they do not cluster based topic.In this paper, a novel algorithm on automatic online hot event detection and tracking is proposed. The algorithm includes three steps:Firstly, we propose a novel similarity function to cluster according to each story's title word, whica can highlight the importance of the blogs'title keywords and improve the effectiveness of topic detection. Secondly, we will identify the valuable title cluster from the candidate title clusters and do topic detection for them. We will clean the unrelated stories using generated topic template vector and our innovative similarity function. Finally, we will refresh the topic template vector after cleanning the valuable title cluster, and recall all the rest of stories and new coming stories related with this event using the new refreshed event template vector.In our experiment, we construct two datasets. Dataset1 contains 13252 news pages with 28 events totally. Dataset2 contains 1589 blog pages with 40 events totally. Our algorithm get 95.04% precision result and 90.17% F-measure result on Dataset1, and get 92.18% precision result and 84.43% F-measure result on Dataset2.This paper implements a system on hot blog events detection and tracking in real-time based on our proposed algorithm. It has run for 120 days approximately and detected shousands of hot events whose total related blog stories number is more than 70 thousand. We choose 648 events in random and then tag them artificially, the precision of the tagged output reach to 84%, it lay a solid foundation for us to replace manual editing, saving human resources and maintain the hot events in a timely, comprehensive, and accurate reporting.
Keywords/Search Tags:Event Detection and Tracking, Web Mining, Topic Cluster, Blog
PDF Full Text Request
Related items