Font Size: a A A

Hot Blog Topic Mining Approach

Posted on:2011-02-15Degree:MasterType:Thesis
Country:ChinaCandidate:X LiuFull Text:PDF
GTID:2178330338489584Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, the data on Internet is growing explosively. How to obtain useful information from internet fast and accurately becomes a focus of people. Topic detection and tracking has been a hot research spot, it has broad application prospects such as monitoring public opinion. Our research goal is to mining domestic and international hot events through the blog in order to feedback to the user in real-time. Most traditional clustering algorithms can not solve the problems on topic mining well, so it is hard to apply. In this thesis, we propose a novel algorithm on hot event mining. It is similar withthe clustering algorithms based on partition methods, but the size of bucket is not set before, it is set by the title keyword. We cluster documents in every bucket to micro-cluster by single-pass algorithm, and merge the similar micro-cluster in different buckets by hierarchical clustering. We propose event templete, seed documents, time window, feature selection and modified similarity function to improve the performance of our algorithm.In our experiment, we construct three datasets. Dataset1 contains 13252 news pages with 28 events totally. Dataset2 contains 1589 blog pages with 40 events totally. Dataset3 is TDT4 corpos. Our algorithm gets 93.04% precision result and 91.73% recall result on Dataset1, and gets 92.18% precision result and 82.37% recall result on Dataset2. The cost is 0.48 in TDT4.This thesis revise and improve the system on hot blog events mining based on our proposed algorithm. It has run for 15 months approximately and detected thousands of hot events whose total related blog stories number is more than 226,373. We choose 648 events in random and tag artificially, the precision of the tagged reach to 83%,which provides a strong guarantee to detect the event automaticlly.
Keywords/Search Tags:Topic Ming, Text Mining, Text Cluster
PDF Full Text Request
Related items